DataCamp offer interactive courses related to R Programming. While some is review, it is helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:
This document is currently split between _v003 and _v003_a and _v003_b due to the need to keep the number of DLL that it opens below the hard-coded maximum. This introductory section needs to be re-written, and the contents consolidated, at a future date.
The original DataCamp_Insights_v001 and DataCamp_Insights_v002 documents have been split for this document:
Chapter 1 - Introduction
Problems in spatial statistics:
Simulation and testing with spatstat:
Further testing:
Example code includes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# The number of points to create
n <- 200
# Set the range
xmin <- 0
xmax <- 1
ymin <- 0
ymax <- 2
# Sample from a Uniform distribution
x <- runif(n, xmin, xmax)
y <- runif(n, ymin, ymax)
# The ratio of the Y axis scale to the X axis scale is called the aspect ratio of the plot. Spatial data should always be presented with an aspect ratio of 1:1.
# See pre-defined variables
# ls.str()
# Plot points and a rectangle
mapxy <- function(a = NA){
plot(x, y, asp = a)
rect(xmin, ymin, xmax, ymax)
}
mapxy(1)
# How do we create a uniform density point pattern in a circle?
# We might first try selecting radius and angle uniformly. But that produces a "cluster" at small distances
# Instead we sample the radius from a non-uniform distribution that scales linearly with distance, so we have fewer points at small radii and more at large radii
# This exercise uses spatstat's disc() function, that creates a circular window.
# Load the spatstat package
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: rpart
##
## spatstat 1.55-0 (nickname: 'Stunned Mullet')
## For an introduction to spatstat, type 'beginner'
##
## Note: R version 3.3.3 (2017-03-06) is more than 9 months old; we strongly recommend upgrading to the latest version
# Create this many points, in a circle of this radius
n_points <- 300
radius <- 10
# Generate uniform random numbers up to radius-squared
r_squared <- runif(n_points, 0, radius**2)
angle <- runif(n_points, 0, 2*pi)
# Take the square root of the values to get a uniform spatial distribution
x <- sqrt(r_squared) * cos(angle)
y <- sqrt(r_squared) * sin(angle)
plot(spatstat::disc(radius))
points(x, y)
# Some variables have been pre-defined
# ls.str()
# Set coordinates and window
ppxy <- ppp(x = x, y = y, window = disc(radius))
# Test the point pattern
qt <- quadrat.test(ppxy)
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
# Inspect the results
plot(qt)
print(qt)
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: ppxy
## X2 = 25.459, df = 24, p-value = 0.7624
## alternative hypothesis: two.sided
##
## Quadrats: 25 tiles (irregular windows)
# In the previous exercise you used a set of 300 events scattered uniformly within a circle
# If you repeated the generation of the events again you will still have 300 of them, but in different locations
# The dataset of exactly 300 points is from a Poisson point process conditioned on the total being 300
# The spatstat package can generate Poisson spatial processes with the rpoispp() function given an intensity and a window, that are not conditioned on the total
# Just as the random number generator functions in R start with an "r", most of the random point-pattern functions in spatstat start with an "r".
# The area() function of spatstat will compute the area of a window such as a disc
# Create a disc of radius 10
disc10 <- disc(10)
# Compute the rate as count divided by area
lambda <- 500 / area(disc10)
# Create a point pattern object
ppois <- rpoispp(lambda = lambda, win = disc10)
# Plot the Poisson point pattern
plot(ppois)
# The spatstat package also has functions for generating point patterns from other process modelsparameters.
# These generally fall into one of two classes: clustered processes, where points occur together more than under a uniform Poisson process,
# and regular (aka inhibitory) processes where points are more spaced apart than under a uniform intensity Poisson process
# Some process models can generate patterns on a continuum from clustered through uniform to regular depending on their parameters
# The quadrat.test() function can test against clustered or regular alternative hypotheses
# By default it tests against either of those, but this can be changed with the alternative parameter to create a one-sided test.
# A Thomas process is a clustered pattern where a number of "parent" points, uniformly distributed, create a number of "child" points in their neighborhood
# The child points themselves form the pattern. This is an attractive point pattern, and makes sense for modelling things like trees, since new trees will grow near the original tree
# Random Thomas point patterns can be generated using rThomas()
# This takes three numbers that determine the intensity and clustering of the points, and a window object.
# Conversely the points of a Strauss process cause a lowering in the probability of finding another point nearby
# The parameters of a Strauss process can be such that it is a "hard-core" process, where no two points can be closer than a set threshold
# Creating points from this process involves some clever simulation algorithms
# This is a repulsive point pattern, and makes sense for modelling things like territorial animals, since the other animals of that species will avoid the territory of a given animal
# Random Strauss point patterns can be generated using rStrauss()
# This takes three numbers that determine the intensity and "territory" of the points, and a window object
# Points generated by a Strauss process are sometimes called regularly spaced.
# Create a disc of radius 10
disc10 <- disc(10)
# Generate clustered points from a Thomas process
set.seed(123)
p_cluster <- rThomas(kappa = 0.35, scale = 1, mu = 3, win = disc10)
plot(p_cluster)
# Run a quadrat test
quadrat.test(p_cluster, alternative = "clustered")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: p_cluster
## X2 = 53.387, df = 24, p-value = 0.0005142
## alternative hypothesis: clustered
##
## Quadrats: 25 tiles (irregular windows)
# Regular points from a Strauss process
set.seed(123)
p_regular <- rStrauss(beta = 2.9, gamma = 0.025, R = .5, W = disc10)
## Warning: Simulation will be performed in the containing rectangle and
## clipped to the original window.
plot(p_regular)
# Run a quadrat test
quadrat.test(p_regular, alternative = "regular")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: p_regular
## X2 = 8.57, df = 24, p-value = 0.001619
## alternative hypothesis: regular
##
## Quadrats: 25 tiles (irregular windows)
# Another way of assessing clustering and regularity is to consider each point, and how it relates to the other points
# One simple measure is the distribution of the distances from each point to its nearest neighbor
# The nndist() function in spatstat takes a point pattern and for each point returns the distance to its nearest neighbor
# Instead of working with the nearest-neighbor density, as seen in the histogram, it can be easier to work with the cumulative distribution function, G(r)
# This is the probability of a point having a nearest neighbour within a distance r
# For a uniform Poisson process, G can be computed theoretically, and is G(r) = 1 - exp( - lambda * pi * r ^ 2)
# You can compute G empirically from your data using Gest() and so compare with the theoretical value.
# Events near the edge of the window might have had a nearest neighbor outside the window, and so unobserved
# This will make the distance to its observed nearest neighbor larger than expected, biasing the estimate of G
# There are several methods for correcting this bias
# Plotting the output from Gest shows the theoretical cumulative distribution and several estimates of the cumulative distribution using different edge corrections
# Often these edge corrections are almost indistinguishable, and the lines overlap
# The plot can be used as a quick exploratory test of complete spatial randomness
# Two ppp objects, p_poisson, and p_regular are defined for you
# Point patterns are pre-defined
p_poisson <- ppois
p_poisson
## Planar point pattern: 469 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Calc nearest-neighbor distances for Poisson point data
nnd_poisson <- nndist(p_poisson)
# Draw a histogram of nearest-neighbor distances
hist(nnd_poisson)
# Estimate G(r)
G_poisson <- Gest(p_poisson)
# Plot G(r) vs. r
plot(G_poisson)
# Repeat for regular point data
nnd_regular <- nndist(p_regular)
hist(nnd_regular)
G_regular <- Gest(p_regular)
plot(G_regular)
# A number of other functions of point patterns have been developed
# They are conventionally denoted by various capital letters, including F, H, J, K and L
# The K-function is defined as the expected number of points within a distance of a point of the process, scaled by the intensity
# Like G, this can be computed theoretically for a uniform Poisson process and is K(r) = pi * r ^ 2 - the area of a circle of that radius
# Deviation from pi * r ^ 2 can indicate clustering or point inhibition
# Computational estimates of K(r) are done using the Kest() function.
# As with G calculations, K-function calculations also need edge corrections
# The default edge correction in spatstat is generally the best, but can be slow, so we'll use the "border" correction for speed here
# Uncertainties on K-function estimates can be assessed by randomly sampling points from a uniform Poisson process in the area and computing the K-function of the simulated data
# Repeat this process 99 times, and take the minimum and maximum value of K over each of the distance values
# This gives an envelope - if the K-function from the data goes above the top of the envelope then we have evidence for clustering
# If the K-function goes below the envelope then there is evidence for an inhibitory process causing points to be spaced out
# Envelopes can be computed using the envelope() function
# The plot method for estimates of K uses a formula system where a dot on the left of a formula refers to K®
# So the default plot uses . ~ r
# You can compare the estimate of K to a Poisson process by plotting . - pi * r ^ 2 ~ r
# If the data was generated by a Poisson process, then the line should be close to zero for all values of r
# Point patterns are pre-defined
p_poisson
## Planar point pattern: 469 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_cluster
## Planar point pattern: 332 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Estimate the K-function for the Poisson points
K_poisson <- Kest(p_poisson, correction = "border")
# The default plot shows quadratic growth
plot(K_poisson, . ~ r)
# Subtract pi * r ^ 2 from the Y-axis and plot
plot(K_poisson, . - pi * r**2 ~ r)
# Compute envelopes of K under random locations
K_cluster_env <- envelope(p_cluster, Kest, correction = "border")
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
# Insert the full formula to plot K minus pi * r^2
plot(K_cluster_env, . - pi * r^2 ~ r)
# Repeat for regular data
K_regular_env <- envelope(p_regular, Kest, correction = "border")
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
plot(K_regular_env, . - pi * r^2 ~ r)
Chapter 2 - Point Pattern Analysis
Bivariate point problems:
Spatial segregation:
Space-time data:
Space-time clustering:
Example code includes:
# The dataset we shall use for this example consists of crimes in a 4km radius of the center of Preston, a town in north-west England
# We want to look for hotspots of violent crime in the area
# A ppp object called preston_crime has been constructed
# This is a marked point process, where each point is marked as either a "Violent Crime" or a "Non-violent crime"
# The marks for each point can be retrieved using the marks() function
# The window is a 4km circle centered on the town center
# A map image of the town from OpenStreetMap has also been loaded, called preston_osm
preston_crime <- readRDS("./RInputFiles/pcrime-spatstat.RDS")
preston_osm <- readRDS("./RInputFiles/osm_preston_gray.RDS")
# Get some summary information on the dataset
summary(preston_crime)
## Marked planar point pattern: 2036 points
## Average intensity 4.053214e-05 points per square unit
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
##
## Multitype:
## frequency proportion intensity
## Non-violent crime 1812 0.8899804 3.607281e-05
## Violent crime 224 0.1100196 4.459332e-06
##
## Window: polygonal boundary
## single connected closed polygon with 99 vertices
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
## Window area = 50231700 square units
## Fraction of frame area: 0.785
# Get a table of marks
table(marks(preston_crime))
##
## Non-violent crime Violent crime
## 1812 224
# Define a function to create a map
preston_map <- function(cols = c("green","red"), cex = c(1, 1), pch = c(1, 1)) {
raster::plotRGB(preston_osm) # from the raster package
plot(preston_crime, cols = cols, pch = pch, cex = cex, add = TRUE, show.window = TRUE)
}
# Draw the map with colors, sizes and plot character
preston_map(
cols = c("black", "red"),
cex = c(0.5, 1),
pch = 19
)
# One method of computing a smooth intensity surface from a set of points is to use kernel smoothing
# Imagine replacing each point with a dot of ink on absorbent paper
# Each individual ink drop spreads out into a patch with a dark center, and multiple drops add together and make the paper even darker
# With the right amount of ink in each drop, and with paper of the right absorbency, you can create a fair impression of the density of the original points
# In kernel smoothing jargon, this means computing a bandwidth and using a particular kernel function
# To get a smooth map of violent crimes proportion, we can estimate the intensity surface for violent and non-violent crimes, and take the ratio
# To do this with the density() function in spatstat, we have to split the points according to the two values of the marks and then compute the ratio of the violent crime surface to the total
# The function has sensible defaults for the kernel function and bandwidth to guarantee something that looks at least plausible
# preston_crime has been pre-defined
preston_crime
## Marked planar point pattern: 2036 points
## Multitype, with levels = Non-violent crime, Violent crime
## window: polygonal boundary
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
# Use the split function to show the two point patterns
crime_splits <- split(preston_crime)
# Plot the split crime
plot(crime_splits)
# Compute the densities of both sets of points
crime_densities <- density(crime_splits)
# Calc the violent density divided by the sum of both
frac_violent_crime_density <- crime_densities[[2]] /
(crime_densities[[1]] + crime_densities[[2]])
# Plot the density of the fraction of violent crime
plot(frac_violent_crime_density)
# We can get a more principled measure of the violent crime ratio using a spatial segregation model
# The spatialkernel package implements the theory of spatial segregation
# The first step is to compute the optimal bandwidth for kernel smoothing under the segregation model
# A small bandwidth would result in a density that is mostly zero, with spikes at the event locations
# A large bandwidth would flatten out any structure in the events, resulting in a large "blob" across the whole window
# Somewhere between these extremes is a bandwidth that best represents an underlying density for the process
# spseg() will scan over a range of bandwidths and compute a test statistic using a cross-validation method
# The bandwidth that maximizes this test statistic is the one to use
# The returned value from spseg() in this case is a list, with h and cv elements giving the values of the statistic over the input h values
# The spatialkernel package supplies a plotcv function to show how the test value varies
# The hcv element has the value of the best bandwidth
# spatstat is loaded and the preston_crime object is read in
# Scan from 500m to 1000m in steps of 50m
bw_choice <- spatialkernel::spseg(
preston_crime,
h = seq(500, 1000, by = 50),
opt = 1)
##
## Calculating cross-validated likelihood function
# Plot the results and highlight the best bandwidth
spatialkernel::plotcv(bw_choice)
abline(v = bw_choice$hcv, lty = 2, col = "red")
# Print the best bandwidth
print(bw_choice$hcv)
## [1] 800
# The second step is to compute the probabilities for violent and non-violent crimes as a smooth surface, as well as the p-values for a point-wise test of segregation
# This is done by calling spseg() with opt = 3 and a fixed bandwidth parameter h
# Normally you would run this process for at least 100 simulations, but that will take too long to run here
# Instead, run for only 10 simulations
# Then you can use a pre-loaded object seg which is the output from a 1000 simulation run that took about 20 minutes to complete
# Set the correct bandwidth and run for 10 simulations only
seg10 <- spatialkernel::spseg(
pts = preston_crime,
h = bw_choice$hcv,
opt = 3,
ntest = 10,
proc = FALSE)
# Plot the segregation map for violent crime
spatialkernel::plotmc(seg10, "Violent crime")
# Plot seg, the result of running 1000 simulations (not included here)
# spatialkernel::plotmc(seg, "Violent crime")
# With a base map and some image and contour functions we can display both the probabilities and the significance tests over the area with more control than the plotmc() function.
# The seg object is a list with several components
# The X and Y coordinates of the grid are stored in the $gridx and $gridy elements
# The probabilities of each class of data (violent or non-violent crime) are in a matrix element $p with a column for each class
# The p-value of the significance test is in a similar matrix element called $stpvalue
# Rearranging columns of these matrices into a grid of values can be done with R's matrix() function
# From there you can construct list objects with a vector $x of X-coordinates, $y of Y-coordinates, and $z as the matrix
# You can then feed this to image() or contour() for visualization
# This process may seem complex, but remember that with R you can always write functions to perform complex tasks and those you may repeat often
# For example, to help with the mapping in this exercise you will create a function that builds a map from four different items
# The seg object from 1000 simulations is loaded, as well as the preston_crime points and the preston_osm map image
# Inspect the structure of the spatial segregation object
# str(seg)
# Get the number of columns in the data so we can rearrange to a grid
# ncol <- length(seg$gridx)
# Rearrange the probability column into a grid
# prob_violent <- list(x = seg$gridx,
# y = seg$gridy,
# z = matrix(seg$p[, "Violent crime"],
# ncol = ncol))
# image(prob_violent)
# Rearrange the p-values, but choose a p-value threshold
# p_value <- list(x = seg$gridx,
# y = seg$gridy,
# z = matrix(seg$stpvalue[, "Violent crime"] < 0.05,
# ncol = ncol))
# image(p_value)
# Create a mapping function
# segmap <- function(prob_list, pv_list, low, high){
#
# # background map
# plotRGB(preston_osm)
#
# # p-value areas
# image(pv_list,
# col = c("#00000000", "#FF808080"), add = TRUE)
#
# # probability contours
# contour(prob_list,
# levels = c(low, high),
# col = c("#206020", "red"),
# labels = c("Low", "High"),
# add = TRUE)
#
# # boundary window
# plot(Window(preston_crime), add = TRUE)
# }
#
# # Map the probability and p-value
# segmap(prob_violent, p_value, 0.05, 0.15)
# The sasquatch, or "bigfoot", is a large ape-like creature reported to live in North American forests
# The Bigfoot Field Researchers Organization maintains a database of sightings and allows its use for teaching and research
# A cleaned subset of data in north-west USA has been created as the ppp object sasq and is loaded for you to explore the space-time pattern of sightings in the area
# Get a quick summary of the dataset
sasq <- readRDS("./RInputFiles/sasquatch.RDS")
summary(sasq)
## Marked planar point pattern: 423 points
## Average intensity 2.097156e-09 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
##
## Mark variables: date, year, month
## Summary:
## date year month
## Min. :1990-05-03 Y2004 : 41 Sep : 59
## 1st Qu.:2000-04-30 Y2000 : 36 Oct : 56
## Median :2003-11-05 Y2002 : 30 Aug : 54
## Mean :2003-08-11 Y2005 : 30 Jul : 50
## 3rd Qu.:2007-11-02 Y2001 : 26 Nov : 43
## Max. :2016-04-05 Y2008 : 26 Jun : 41
## (Other):234 (Other):120
##
## Window: polygonal boundary
## single connected closed polygon with 64 vertices
## enclosing rectangle: [368187.8, 764535.6] x [4644873, 5434933] units
## Window area = 2.01702e+11 square units
## Fraction of frame area: 0.644
# Plot unmarked points
plot(unmark(sasq))
# Plot the points using a circle sized by date
plot(sasq, which.marks = "date")
# Show the available marks
names(marks(sasq))
## [1] "date" "year" "month"
# Histogram the dates of the sightings, grouped by year
hist(marks(sasq)$date, "years", freq = TRUE)
# Plot and tabulate the calendar month of all the sightings
plot(table(marks(sasq)$month))
# Split on the month mark
sasq_by_month <- split(sasq, "month", un = TRUE)
# Plot monthly maps
plot(sasq_by_month)
# Plot smoothed versions of the above split maps
plot(density(sasq_by_month))
# To do a space-time clustering test with stmctest() from the splancs package, you first need to convert parts of your ppp object
# Functions in splancs tend to use matrix data instead of data frames.
# To run stmctest() you need to set up
# event locations
# event times
# region polygon
# time limits
# the time and space ranges for analysis
# The sasq object is loaded and the spatstat and splancs packages are ready for use
# Get a matrix of event coordinates
sasq_xy <- as.matrix(spatstat::coords(sasq))
# Check the matrix has two columns
dim(sasq_xy)
## [1] 423 2
# Get a vector of event times
sasq_t <- marks(sasq)$date
# Extract a two-column matrix from the ppp object
sasq_poly <- as.matrix(as.data.frame(Window(sasq)))
dim(sasq_poly)
## [1] 64 2
# Set the time limit to 1 day before and 1 day after the range of times
tlimits <- range(sasq_t) + c(-1, 1)
# Scan over 400m intervals from 100m to 20km
s <- seq(100, 20000, by = 400)
# Scan over 14 day intervals from one week to 31 weeks
tm <- seq(7, 217, by = 14)
# Everything is now ready for you to run the space-time clustering test function
# You can then plot the results and compute a p-value for rejecting the null hypothesis of no space-time clustering
# Any space-time clustering in a data set will be removed if you randomly rearrange the dates of the data points
# The stmctest() function computes a clustering test statistic for your data based on the space-time K-function - how many points are within a spatial and temporal window of a point of the data
# It then does a number of random rearrangements of the dates among the points and computes the clustering statistic
# After doing this a large number of times, you can compare the test statistic for your data with the values from the random data
# If the test statistic for your data is sufficiently large or small, you can reject the null hypothesis of no space-time clustering
# The output from stmctest() is a list with a single t0 which is the test statistic for your data, and a vector of t from the simulations
# By converting to data frame you can feed this to ggplot functions
# Because the window area is a large number of square meters, and we have about 400 events, the numerical value of the intensity is a very small number
# This makes values of the various K-functions very large numbers, since they are proportional to the inverse of the intensity
# Don't worry if you see 10^10 or higher
# The p-value of a Monte-Carlo test like this is just the proportion of test statistics that are larger than the value from the data
# You can compute this from the t and t0 elements of the output
# All the objects from the previous exercise are loaded.
# Run 999 simulations
sasq_mc <- splancs::stmctest(sasq_xy, sasq_t, sasq_poly, tlimits, s, tm, nsim = 999, quiet = TRUE)
names(sasq_mc)
## [1] "t0" "t"
# Histogram the simulated statistics and add a line at the data value
ggplot(data.frame(sasq_mc), aes(x = t)) +
geom_histogram(binwidth = 1e13) +
geom_vline(aes(xintercept = t0))
# Compute the p-value as the proportion of tests greater than the data
sum(sasq_mc$t > sasq_mc$t0) / 1000
## [1] 0.04
Chapter 3 - Areal Statistics
Areal statistics:
Spatial health data:
Generalized linear models in space:
Correlation in spatial GLM:
Example code includes:
library(cartogram)
library(rgeos)
## rgeos version: 0.3-26, (SVN revision 560)
## GEOS runtime version: 3.6.1-CAPI-1.10.1 r0
## Linking to sp version: 1.2-7
## Polygon checking: TRUE
library(spdep)
## Loading required package: sp
## Loading required package: Matrix
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge')`
##
## Attaching package: 'spData'
## The following objects are masked _by_ '.GlobalEnv':
##
## x, y
library(epitools)
library(R2BayesX)
## Loading required package: BayesXsrc
## Loading required package: colorspace
##
## Attaching package: 'colorspace'
## The following object is masked from 'package:spatstat':
##
## coords
## Loading required package: mgcv
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
# In 2016 the UK held a public vote on whether to remain in the European Union
# The results of the referendum, where people voted either "Remain" or "Leave", are available online
# The data set london_ref contains the results for the 32 boroughs of London, and includes the number and percentage of votes in each category as well as the count of spoilt votes, the population size and the electorate size
# The london_ref object is a SpatialPolygonsDataFrame, a special kind of data frame where each row also has the shape of the borough
# It behaves like a data frame in many respects, but can also be used to plot a choropleth, or shaded polygon, map
# You should start with some simple data exploration and mapping. The following variables will be useful:
# NAME : the name of the borough.
# Electorate : the total number of people who can vote.
# Remain, Leave : the number of votes for "Remain" or "Leave".
# Pct_Remain, Pct_Leave : the percentage of votes for each sid
# spplot() from the raster package provides a convenient way to draw a shaded map of regions
# See what information we have for each borough
london_ref <- readRDS("./RInputFiles/london_eu.RDS")
summary(london_ref)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
## NAME TOTAL_POP Electorate Votes_Cast
## Length:32 Min. :157711 Min. : 83042 Min. : 54801
## Class :character 1st Qu.:237717 1st Qu.:143458 1st Qu.:104079
## Mode :character Median :272017 Median :168394 Median :116280
## Mean :270780 Mean :169337 Mean :118025
## 3rd Qu.:316911 3rd Qu.:196285 3rd Qu.:134142
## Max. :379691 Max. :245349 Max. :182570
## Remain Leave Rejected_Ballots Pct_Remain
## Min. : 27750 Min. :17138 Min. : 60.0 Min. :30.34
## 1st Qu.: 55973 1st Qu.:32138 1st Qu.:105.0 1st Qu.:53.69
## Median : 70254 Median :45263 Median :138.0 Median :61.01
## Mean : 70631 Mean :47255 Mean :139.0 Mean :60.46
## 3rd Qu.: 84287 3rd Qu.:59018 3rd Qu.:164.2 3rd Qu.:69.90
## Max. :118463 Max. :96885 Max. :267.0 Max. :78.62
## Pct_Leave Pct_Rejected Assembly
## Min. :21.38 Min. :0.0600 Length:32
## 1st Qu.:30.10 1st Qu.:0.0875 Class :character
## Median :38.99 Median :0.1100 Mode :character
## Mean :39.54 Mean :0.1187
## 3rd Qu.:46.31 3rd Qu.:0.1500
## Max. :69.66 Max. :0.2200
# Which boroughs voted to "Leave"?
london_ref$NAME[london_ref$Leave > london_ref$Remain]
## [1] "Sutton" "Barking and Dagenham" "Bexley"
## [4] "Havering" "Hillingdon"
# Plot a map of the percentage that voted "Remain"
sp::spplot(london_ref, zcol = "Pct_Remain")
# Large areas, such as cities or countries, are often divided into smaller administrative units, often into zones of approximately equal population
# But the area of those units may vary considerably
# When mapping them, the large areas carry more visual "weight" than small areas, although just as many people live in the small areas.
# One technique for correcting for this is the cartogram
# This is a controlled distortion of the regions, expanding some and contracting others, so that the area of each region is proportional to a desired quantity, such as the population
# The cartogram also tries to maintain the correct geography as much as possible, by keeping regions in roughly the same place relative to each other
# The cartogram package contains functions for creating cartograms
# You give it a spatial data frame and the name of a column, and you get back a similar data frame but with regions distorted so that the region area is proportional to the column value of the regions
# You'll also use the rgeos package for computing the areas of individual regions with the gArea() function
# Use the cartogram and rgeos packages (called at top of routine)
# library(cartogram)
# library(rgeos)
# Make a scatterplot of electorate vs borough area
names(london_ref)
## [1] "NAME" "TOTAL_POP" "Electorate"
## [4] "Votes_Cast" "Remain" "Leave"
## [7] "Rejected_Ballots" "Pct_Remain" "Pct_Leave"
## [10] "Pct_Rejected" "Assembly"
plot(london_ref$Electorate, gArea(london_ref, byid = TRUE))
# Make a cartogram, scaling the area to the electorate
carto_ref <- cartogram(london_ref, "Electorate")
## Mean size error for iteration 1: 1.5881743190908
## Mean size error for iteration 2: 1.32100446455657
## Mean size error for iteration 3: 1.18227723476121
## Mean size error for iteration 4: 1.10676057030171
## Mean size error for iteration 5: 1.0657703107641
## Mean size error for iteration 6: 1.04259259672006
## Mean size error for iteration 7: 1.02832326230708
## Mean size error for iteration 8: 1.01931941526112
## Mean size error for iteration 9: 1.01341424685212
## Mean size error for iteration 10: 1.00941370606418
## Mean size error for iteration 11: 1.00663364742297
## Mean size error for iteration 12: 1.00470553629914
## Mean size error for iteration 13: 1.00336434720465
## Mean size error for iteration 14: 1.00241457265516
## Mean size error for iteration 15: 1.00174179254187
plot(carto_ref)
# Check the linearity of the electorate-area plot
plot(carto_ref$Electorate, gArea(carto_ref, byid = TRUE))
# Make a fairer map of the Remain percentage
sp::spplot(carto_ref, "Pct_Remain")
# The map of "Remain" votes seems to have spatial correlation
# Pick any two boroughs that are neighbors - with a shared border - and the chances are they'll be more similar than any two random boroughs
# This can be a problem when using statistical models that assume, conditional on the model, that the data points are independent
# The spdep package has functions for measures of spatial correlation, also known as spatial dependency
# Computing these measures first requires you to work out which regions are neighbors via the poly2nb() function, short for "polygons to neighbors"
# The result is an object of class nb
# Then you can compute the test statistic and run a significance test on the null hypothesis of no spatial correlation
# The significance test can either be done by Monte-Carlo or theoretical models
# In this example you'll use the Moran "I" statistic to test the spatial correlation of the population and the percentage "Remain" vote.
# The london_ref spatial data object is loaded for you
# Use the spdep package (called at top of routine)
# library(spdep)
# Make neighbor list
borough_nb <- poly2nb(london_ref)
# Get center points of each borough
borough_centers <- coordinates(london_ref)
# Show the connections
plot(london_ref)
plot(borough_nb, borough_centers, add = TRUE)
# Map the total pop'n
sp::spplot(london_ref, zcol = "TOTAL_POP")
# Run a Moran I test on total pop'n
moran.test(
london_ref$TOTAL_POP,
nb2listw(borough_nb)
)
##
## Moran I test under randomisation
##
## data: london_ref$TOTAL_POP
## weights: nb2listw(borough_nb)
##
## Moran I statistic standard deviate = 1.2124, p-value = 0.1127
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.11549264 -0.03225806 0.01485190
# Map % Remain
sp::spplot(london_ref, zcol = "Pct_Remain")
# Run a Moran I MC test on % Remain
moran.mc(
london_ref$Pct_Remain,
nb2listw(borough_nb),
nsim = 999
)
##
## Monte-Carlo simulation of Moran I
##
## data: london_ref$Pct_Remain
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = 0.42841, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
# The UK's National Health Service publishes weekly data for consultations at a number of "sentinel" clinics and makes this data available
# A dataset for one week in February 2017 has been loaded for you to analyze
# It is called london, and contains data for the 32 boroughs.
# You will focus on reports of "Influenza-like illness", or more simply "Flu"
# Your first task is to map the "Standardized Morbidity Ratio", or SMR
# This is the number of cases per person, but scaled by the overall incidence so that the expected number is 1
# The london object, a spatial data frame, and the sp package are ready for you
# Get a summary of the data set
london <- readRDS("./RInputFiles/london_2017_2.RDS")
summary(london)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
## CODE NAME Flu_OBS Vom_OBS
## Length:32 Length:32 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 11.00 1st Qu.:14.00
## Mode :character Mode :character Median : 33.00 Median :40.00
## Mean : 38.56 Mean :37.28
## 3rd Qu.: 61.00 3rd Qu.:57.50
## Max. :112.00 Max. :96.00
## Diarr_OBS Gastro_OBS TOTAL_POP CCGcode
## Min. : 0.00 Min. : 0.0 Min. :157711 Length:32
## 1st Qu.: 22.50 1st Qu.: 48.0 1st Qu.:237717 Class :character
## Median : 62.00 Median :120.5 Median :272017 Mode :character
## Mean : 57.03 Mean :113.7 Mean :270780
## 3rd Qu.: 90.75 3rd Qu.:176.8 3rd Qu.:316911
## Max. :122.00 Max. :251.0 Max. :379691
## CCG.geography.code CCG.name Asthma_Prevalence
## Length:32 Length:32 Min. :3.550
## Class :character Class :character 1st Qu.:4.412
## Mode :character Mode :character Median :4.660
## Mean :4.624
## 3rd Qu.:4.925
## Max. :5.470
## Obesity_Prevalence Cancer_Prevalence Diabetes_Prevalence Income
## Min. : 3.930 Min. :0.870 Min. :3.620 Min. :0.0730
## 1st Qu.: 5.855 1st Qu.:1.438 1st Qu.:5.265 1st Qu.:0.1308
## Median : 7.565 Median :1.605 Median :6.305 Median :0.1665
## Mean : 7.585 Mean :1.684 Mean :6.245 Mean :0.1655
## 3rd Qu.: 8.810 3rd Qu.:1.903 3rd Qu.:7.067 3rd Qu.:0.1985
## Max. :12.130 Max. :2.540 Max. :9.060 Max. :0.2530
## Employment Education HealthDeprivation Crime
## Min. :0.0570 Min. : 3.958 Min. :-1.4100 Min. :-0.1550
## 1st Qu.:0.0905 1st Qu.:10.047 1st Qu.:-0.5055 1st Qu.: 0.3745
## Median :0.1095 Median :13.925 Median :-0.2050 Median : 0.5515
## Mean :0.1092 Mean :14.024 Mean :-0.2044 Mean : 0.5379
## 3rd Qu.:0.1283 3rd Qu.:17.480 3rd Qu.: 0.2010 3rd Qu.: 0.7823
## Max. :0.1560 Max. :27.182 Max. : 0.5430 Max. : 1.0190
## Services Environment i
## Min. :19.63 Min. :13.37 Min. : 0.00
## 1st Qu.:24.43 1st Qu.:24.03 1st Qu.: 7.75
## Median :30.41 Median :28.20 Median :15.50
## Mean :29.55 Mean :31.38 Mean :15.50
## 3rd Qu.:34.74 3rd Qu.:40.15 3rd Qu.:23.25
## Max. :41.89 Max. :55.00 Max. :31.00
# Map the OBServed number of flu reports
sp::spplot(london, "Flu_OBS")
# Compute and print the overall incidence of flu
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
r
## [1] 0.0001424128
# Calculate the expected number for each borough
london$Flu_EXP <- london$TOTAL_POP * r
# Calculate the ratio of OBServed to EXPected
london$Flu_SMR <- london$Flu_OBS / london$Flu_EXP
# Map the SMR
sp::spplot(london, "Flu_SMR")
# SMRs above 1 represent high rates of disease - but how high does an SMR need to be before it can be considered statistically significant?
# Given a number of cases and a population, its possible to work out confidence intervals at some level of the estimate of the ratio of cases per population using the properties of the binomial distribution
# The epitools package has a function binom.exact() which you can use to compute confidence intervals for the flu data
# These can be scaled to be confidence intervals on the SMR by dividing by the overall rate
# The london data set and the sp package are loaded
# For the binomial statistics function (called at top of routine)
# library(epitools)
# Get CI from binomial distribution
flu_ci <- binom.exact(london$Flu_OBS, london$TOTAL_POP)
# Add borough names
flu_ci$NAME <- london$NAME
# Calculate London rate, then compute SMR
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
flu_ci$SMR <- flu_ci$proportion / r
# Subset the high SMR data
flu_high <- flu_ci[flu_ci$SMR > 1, ]
# Plot estimates with CIs
ggplot(flu_high, aes(x = NAME, y = proportion / r, ymin = lower / r, ymax = upper / r)) +
geom_pointrange() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Distributions and confidence intervals can be difficult things to present to non-statisticians
# An alternative is to present a probability that a value is over a threshold
# For example, public health teams might be interested in when an SMR has more than doubled, and as a statistician you can give a probability that this has happened
# Then the public health team might decide to go to some alert level when the probability of a doubling of SMR is over 0.95
# Again, the properties of the binomial distribution let you compute this for proportional data
# You can then map these exceedence probabilities for some threshold, and use a sensible color scheme to highlight probabilities close to 1
# The london data set has been loaded, and the expected flu case count, Flu_EXP has been computed
# Probability of a binomial exceeding a multiple
binom.exceed <- function(observed, population, expected, e){
1 - pbinom(e * expected, population, prob = observed / population)
}
# Compute P(rate > 2)
london$Flu_gt_2 <- binom.exceed(
observed = london$Flu_OBS,
population = london$TOTAL_POP,
expected = london$Flu_EXP,
e = 2)
# Use a 50-color palette that only starts changing at around 0.9
pal <- c(
rep("#B0D0B0", 40),
colorRampPalette(c("#B0D0B0", "orange"))(5),
colorRampPalette(c("orange", "red"))(5)
)
# Plot the P(rate > 2) map
sp::spplot(london, "Flu_gt_2", col.regions = pal, at = seq(0, 1, len = 50))
# A Poisson generalized linear model is a way of fitting count data to explanatory variables
# You get out parameter estimates and standard errors for your explanatory variables, and can get fitted values and residuals
# The glm() function fits Poisson GLMs. It works just like the lm() function, but you also specify a family argument
# The formula has the usual meaning - response on the left of the ~, and explanatory variables on the right
# To cope with count data coming from populations of different sizes, you specify an offset argument
# This adds a constant term for each row of the data in the model. The log of the population is used in the offset term
# The london health data set has been loaded with the sp package also ready.
# Run a Poisson generalized linear model of how the count of flu cases, Flu_OBS, depends on the Health Deprivation index value, HealthDeprivation
# The first argument is the formula (response vairable on the left)
# The family argument is a function, poisson
# Fit a poisson GLM.
model_flu <- glm(
Flu_OBS ~ HealthDeprivation,
offset = log(TOTAL_POP),
data = london,
family = "poisson")
# Is HealthDeprivation significant?
summary(model_flu)
##
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = "poisson",
## data = london, offset = log(TOTAL_POP))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.5361 -4.5285 -0.0499 2.9043 8.2194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.78190 0.02869 -306.043 <2e-16 ***
## HealthDeprivation 0.65689 0.06797 9.665 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 703.75 on 31 degrees of freedom
## Residual deviance: 605.03 on 30 degrees of freedom
## AIC: 762.37
##
## Number of Fisher Scoring iterations: 5
# Put residuals into the spatial data.
london$Flu_Resid <- residuals(model_flu)
# Map the residuals using spplot
sp::spplot(london, "Flu_Resid")
# A linear model should fit the data and leave uncorrelated residuals
# This applies to non-spatial models, where, for example, fitting a straight line through points on a curve would lead to serially-correlated residuals
# A model on spatial data should aim to have residuals that show no significant spatial correlation
# You can test the model fitted to the flu data using moran.mc() from the spdep package
# Monte Carlo Moran tests were previously discussed in the Spatial autocorrelation test exercise earlier in the chapter
# Compute the neighborhood structure.
borough_nb <- poly2nb(london)
# Test spatial correlation of the residuals.
moran.mc(london$Flu_Resid, listw = nb2listw(borough_nb), nsim = 999)
##
## Monte-Carlo simulation of Moran I
##
## data: london$Flu_Resid
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = 0.15059, observed rank = 925, p-value = 0.075
## alternative hypothesis: greater
# Bayesian statistical models return samples of the parameters of interest (the "posterior" distribution) based on some "prior" distribution which is then updated by the data
# The Bayesian modelling process returns a number of samples from which you can compute the mean, or an exceedence probability, or any other quantity you might compute from a distribution
# Before you fit a model with spatial correlation, you'll first fit the same model as before, but using Bayesian inference
# The london data set has been loaded
# The R2BayesX package provides an interface to the BayesX code.
# The syntax for bayesx() is similar, but the offset has to be specified explicitly from the data frame, the family name is in quotes, and the spatial data frame needs to be turned into a plain data frame
# Run the model fitting and inspect with summary()
# Plot the samples from the Bayesian model
# On the left is the "trace" of samples in sequential order, and on the right is the parameter density
# For this model there is an intercept and a slope for the Health Deprivation score
# The parameter density should correspond with the parameter summary
# Use R2BayesX (called at top of routine)
# library(R2BayesX)
# Fit a GLM
model_flu <- glm(Flu_OBS ~ HealthDeprivation, offset = log(TOTAL_POP),
data = london, family = poisson)
# Summarize it
summary(model_flu)
##
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = poisson,
## data = london, offset = log(TOTAL_POP))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.5361 -4.5285 -0.0499 2.9043 8.2194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.78190 0.02869 -306.043 <2e-16 ***
## HealthDeprivation 0.65689 0.06797 9.665 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 703.75 on 31 degrees of freedom
## Residual deviance: 605.03 on 30 degrees of freedom
## AIC: 762.37
##
## Number of Fisher Scoring iterations: 5
# Calculate coeff confidence intervals
confint(model_flu)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -8.838677 -8.7261843
## HealthDeprivation 0.524437 0.7908841
# Fit a Bayesian GLM
bayes_flu <- bayesx(Flu_OBS ~ HealthDeprivation, offset = log(london$TOTAL_POP),
family = "poisson", data = as.data.frame(london),
control = bayesx.control(seed = 17610407))
# Summarize it
summary(bayes_flu)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation, data = as.data.frame(london),
## offset = log(london$TOTAL_POP), control = bayesx.control(seed = 17610407),
## family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -8.7831 0.0278 -8.8371 -8.7841 -8.7263
## HealthDeprivation 0.6592 0.0659 0.5345 0.6587 0.7900
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# Look at the samples from the Bayesian model
plot(samples(bayes_flu))
# You've fitted a non-spatial GLM with BayesX
# You can include a spatially correlated term based on the adjacency structure by adding a term to the formula specifying a spatially correlated model
# Use poly2nb() to compute the neighborhood structure of london to an nb object
# R2BayesX uses its own objects for the adjacency. Convert the nb object to a gra object
# The sx function specifies additional terms to bayesx. Create a term using the "spatial" basis and the gra object for the boroughs to define the map
# Print a summary of the model object. You should see a table of coefficients for the parametric part of the model as in the previous exercise, and then a table of "Smooth terms variance" with one row for the spatial term
# Note that since BayesX can fit many different forms in its sx terms, most of which, like the spatial model here, cannot be simply expressed with a parameter or two
# This table shows the variance of the random effects - for further explanation consult a text on random effects modelling
# Compute adjacency objects
borough_nb <- poly2nb(london)
borough_gra <- nb2gra(borough_nb)
# Fit spatial model
flu_spatial <- bayesx(
Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", map = borough_gra),
offset = log(london$TOTAL_POP),
family = "poisson", data = data.frame(london),
control = bayesx.control(seed = 17610407)
)
## Note: created new output directory 'C:/Users/Dave/AppData/Local/Temp/Rtmpa43JUL/bayesx1'!
# Summarize the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial",
## map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP),
## control = bayesx.control(seed = 17610407), family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -9.2311 0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation 0.7683 0.5844 -0.3749 0.7775 1.7922
##
## Smooth terms variances:
## Mean Sd 2.5% 50% 97.5% Min Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# As with glm, you can get the fitted values and residuals from your model using the fitted and residuals functions. bayesx models are a bit more complex, since you have the linear predictor and terms from sx elements, such as the spatially correlated term
# The summary function will show you information for the linear model terms and the smoothing terms in two separate tables
# The spatial term is called "sx(i):mrf" - standing for "Markov Random Field"
# Bayesian analysis returns samples from a distribution for our S(x) term at each of the London boroughs
# The fitted function from bayesx models returns summary statistics for each borough
# You'll just look at the mean of that distribution for now
# The model from the BayesX output is available as flu_spatial.
# Summarise the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial",
## map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP),
## control = bayesx.control(seed = 17610407), family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -9.2311 0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation 0.7683 0.5844 -0.3749 0.7775 1.7922
##
## Smooth terms variances:
## Mean Sd 2.5% 50% 97.5% Min Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# Map the fitted spatial term only
london$spatial <- fitted(flu_spatial, term = "sx(i):mrf")[, "Mean"]
sp::spplot(london, zcol = "spatial")
# Map the residuals
london$spatial_resid <- residuals(flu_spatial)[, "mu"]
sp::spplot(london, zcol = "spatial_resid")
# Test residuals for spatial correlation
moran.mc(london$spatial_resid, nb2listw(borough_nb), 999)
##
## Monte-Carlo simulation of Moran I
##
## data: london$spatial_resid
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = -0.26922, observed rank = 16, p-value = 0.984
## alternative hypothesis: greater
Chapter 4 - Geostatistics
Geostatistical data:
Variogram:
Kriging predictions:
Automatic kriging:
Wrap up:
Example code includes:
# Your job is to study the acidity (pH) of some Canadian survey data. The survey measurements are loaded into a spatial data object called ca_geo
# ca_geo has been pre-defined
ca_geo <- readRDS("./RInputFiles/ca_geo.RDS")
summary(ca_geo)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## x 542608.7 714269.2
## y 5541290.4 5652558.9
## Is projected: TRUE
## proj4string :
## [+init=epsg:32609 +proj=utm +zone=9 +datum=WGS84 +units=m +no_defs
## +ellps=WGS84 +towgs84=0,0,0]
## Number of points: 1140
## Data attributes:
## ID Elev pH Zn
## 102I881003: 1 Min. : 5.0 Min. :3.900 Min. : 1.00
## 102I881004: 1 1st Qu.: 20.0 1st Qu.:6.100 1st Qu.: 40.00
## 102I881005: 1 Median :110.0 Median :6.600 Median : 62.00
## 102I881006: 1 Mean :183.6 Mean :6.531 Mean : 72.34
## 102I881007: 1 3rd Qu.:310.0 3rd Qu.:7.000 3rd Qu.: 88.00
## 102I881008: 1 Max. :914.0 Max. :8.700 Max. :510.00
## (Other) :1134 NA's :9 NA's :33
## Cu Pb Ni Co
## Min. : 1.00 Min. : 1.000 Min. : 1.00 Min. : 1.00
## 1st Qu.: 21.00 1st Qu.: 1.000 1st Qu.: 7.00 1st Qu.:11.00
## Median : 37.00 Median : 1.000 Median : 20.00 Median :19.00
## Mean : 57.45 Mean : 2.975 Mean : 27.85 Mean :20.16
## 3rd Qu.: 76.00 3rd Qu.: 3.000 3rd Qu.: 37.00 3rd Qu.:27.00
## Max. :2950.00 Max. :195.000 Max. :340.00 Max. :77.00
##
## Ag Mn Fe Mo
## Min. :0.1000 Min. : 2.0 Min. : 0.010 Min. : 1.000
## 1st Qu.:0.1000 1st Qu.: 490.0 1st Qu.: 4.000 1st Qu.: 1.000
## Median :0.1000 Median : 820.0 Median : 5.100 Median : 1.000
## Mean :0.1146 Mean : 959.5 Mean : 5.168 Mean : 1.654
## 3rd Qu.:0.1000 3rd Qu.:1200.0 3rd Qu.: 6.200 3rd Qu.: 2.000
## Max. :7.9000 Max. :9700.0 Max. :24.000 Max. :46.000
##
## U W Sn Hg
## Min. :-1.00 Min. :-1.00 Min. : 1.000 Min. : 5
## 1st Qu.: 0.70 1st Qu.: 1.00 1st Qu.: 1.000 1st Qu.: 60
## Median : 1.10 Median : 1.00 Median : 1.000 Median : 80
## Mean : 1.36 Mean : 1.14 Mean : 1.123 Mean : 232
## 3rd Qu.: 1.70 3rd Qu.: 1.00 3rd Qu.: 1.000 3rd Qu.: 120
## Max. : 9.10 Max. :32.00 Max. :140.000 Max. :20000
##
## As Sb Ba Cd
## Min. : 1.00 Min. : 0.1000 Min. : 5 Min. : 0.100
## 1st Qu.: 5.00 1st Qu.: 0.1000 1st Qu.: 200 1st Qu.: 0.100
## Median : 6.00 Median : 0.1000 Median : 300 Median : 0.100
## Mean : 10.95 Mean : 0.2411 Mean : 301 Mean : 0.165
## 3rd Qu.: 10.00 3rd Qu.: 0.1000 3rd Qu.: 390 3rd Qu.: 0.100
## Max. :360.00 Max. :15.0000 Max. :1800 Max. :14.800
##
## V Bi Cr LoI
## Min. : 2.0 Min. :0.1000 Min. : 5.0 Min. :-1.00
## 1st Qu.:215.0 1st Qu.:0.1000 1st Qu.: 52.0 1st Qu.: 6.20
## Median :295.0 Median :0.1000 Median : 88.0 Median : 9.20
## Mean :318.3 Mean :0.1213 Mean :114.1 Mean :11.45
## 3rd Qu.:410.0 3rd Qu.:0.1000 3rd Qu.:148.0 3rd Qu.:14.00
## Max. :960.0 Max. :4.2000 Max. :860.0 Max. :82.80
##
## F Au
## Min. : 20.0 Min. : 1.00
## 1st Qu.:120.0 1st Qu.: 1.00
## Median :150.0 Median : 2.00
## Mean :164.2 Mean : 24.55
## 3rd Qu.:200.0 3rd Qu.: 5.00
## Max. :620.0 Max. :5800.00
##
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# See what measurements are at each location
names(ca_geo)
## [1] "ID" "Elev" "pH" "Zn" "Cu" "Pb" "Ni" "Co" "Ag" "Mn"
## [11] "Fe" "Mo" "U" "W" "Sn" "Hg" "As" "Sb" "Ba" "Cd"
## [21] "V" "Bi" "Cr" "LoI" "F" "Au"
# Get a summary of the acidity (pH) values
summary(ca_geo$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.900 6.100 6.600 6.531 7.000 8.700 33
# Look at the distribution
hist(ca_geo$pH)
# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)
table(miss)
## miss
## FALSE TRUE
## 1107 33
# Plot a map of acidity
spplot(ca_geo[!miss, ], "pH")
# The acidity data shows pH broadly increasing from north-east to south-west. Fitting a linear model with the coordinates as covariates will interpolate a flat plane through the values
# ca_geo has been pre-defined
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# Are they called lat-long, up-down, or what?
coordnames(ca_geo)
## [1] "x" "y"
# Complete the formula
m_trend <- lm(pH ~ x + y, as.data.frame(ca_geo))
# Check the coefficients
summary(m_trend)
##
## Call:
## lm(formula = pH ~ x + y, data = as.data.frame(ca_geo))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.83561 -0.32091 -0.00761 0.33188 2.06249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.358e+01 3.002e+00 27.84 <2e-16 ***
## x -5.691e-06 3.483e-07 -16.34 <2e-16 ***
## y -1.313e-05 5.319e-07 -24.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5299 on 1104 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.4237, Adjusted R-squared: 0.4227
## F-statistic: 405.9 on 2 and 1104 DF, p-value: < 2.2e-16
# Your next task is to compute the pH at the locations that have missing data in the source. You can use the predict() function on the fitted model from the previous exercise for this
# ca_geo, miss, m_trend have been pre-defined
# ls.str()
# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)
# Create a data frame of missing data
ca_geo_miss <- as.data.frame(ca_geo)[miss, ]
# Predict pH for the missing data
predictions <- predict(m_trend, newdata = ca_geo_miss, se.fit = TRUE)
# Compute the exceedence probability
pAlkaline <- 1 - pnorm(7, mean = predictions$fit, sd = predictions$se.fit)
hist(pAlkaline)
# You can use the gstat package to plot variogram clouds and the variograms from data. Recall:
# The variogram cloud shows the differences of the measurements against distance for all pairs of data points
# The binned variogram divides the cloud into distance bins and computes the average difference within each bin
# The y-range of the binned variogram is always much smaller than the variogram cloud because the cloud includes the full range of values that go into computing the mean for the binned variogram
# The acidity survey data, ca_geo and the missing value index, miss have been pre-defined
# The gstat variogram() function uses the cloud argument to plot a variogram cloud - the default cloud parameter is FALSE
# ca_geo, miss have been pre-defined
# ls.str()
# Make a cloud from the non-missing data up to 10km
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ], cloud = TRUE, cutoff = 10000))
# Make a variogram of the non-missing data
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ]))
# You might imagine that if soil at a particular point is alkaline, then soil one metre away is likely to be alkaline too
# But can you say the same thing about soil one kilometre away, or ten kilometres, or one hundred kilometres?
# The shape of the previous variogram tells you there is a large-scale trend in the data
# You can fit a variogram considering this trend with gstat
# This variogram should flatten out, indicating there is no more spatial correlation after a certain distance with the trend taken into account
# ca_geo, miss have been pre-defined
# ls.str()
# See what coordinates are called
coordnames(ca_geo)
## [1] "x" "y"
# The pH depends on the coordinates
ph_vgm <- gstat::variogram(pH ~ x + y, ca_geo[!miss, ])
plot(ph_vgm)
# Next you'll fit a model to your variogram
# The gstat function fit.variogram() does this
# You need to give it some initial values as a starting point for the optimization algorithm to fit a better model
# The sill is the the upper limit of the model
# That is, the long-range largest value, ignoring any outliers
# A variogram has been plotted for you, and ph_vgm has been pre-defined
# Estimate some parameters by eyeballing the plot
# The nugget is the value of the semivariance at zero distance.
# The partial sill, psill is the difference between the sill and the nugget.
# Set the range to the distance at which the variogram has got about half way between the nugget and the sill
# Fit a variogram model by calling fit.variogram()
# The second argument should take the parameters you estimated, wrapped in a call to vgm()
# ca_geo, miss, ph_vgm have been pre-defined
# ls.str()
# Eyeball the variogram and estimate the initial parameters
nugget <- 0.16
psill <- 0.15
range <- 10000
# Fit the variogram
v_model <- gstat::fit.variogram(
ph_vgm,
model = gstat::vgm(
model = "Ste",
nugget = nugget,
psill = psill,
range = range,
kappa = 0.5
)
)
# Show the fitted variogram on top of the binned variogram
plot(ph_vgm, model = v_model)
print(v_model)
## model psill range kappa
## 1 Nug 0.1545116 0.00 0.0
## 2 Ste 0.1442007 14379.29 0.5
# The final part of geostatical estimation is kriging itself
# This is the application of the variogram along with the sample data points to produce estimates and uncertainties at new locations
# The computation of estimates and uncertainties, together with the assumption of a normal (Gaussian) response means you can compute any function of the estimates - for example the probability of a new location having alkaline soil
# The acidity survey data, ca_geo, the missing value index, miss, and the variogram model, v_model, have been pre-defined
# ca_geo, miss, v_model have been pre-defined
# ls.str()
# Set the trend formula and the new data
km <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = ca_geo[miss, ], model = v_model)
## [using universal kriging]
names(km)
## [1] "var1.pred" "var1.var"
# Plot the predicted values
spplot(km, "var1.pred")
# Compute the probability of alkaline samples, and map
km$pAlkaline <- 1 - pnorm(7, mean = km$var1.pred, sd = sqrt(km$var1.var))
spplot(km, "pAlkaline")
# You have been asked to produce an alkaline probability map over the study area
# To do this, you are going to do some kriging via the krige() function
# This requires a SpatialPixels object which will take a bit of data manipulation to create
# You start by defining a grid, creating points on that grid, cropping to the study region, and then finally converting to SpatialPixels
# On the way, you'll meet some new functions
# GridTopology() defines a rectangular grid. It takes three vectors of length two as inputs
# The first specifies the position of the bottom left corner of the grid
# The second specifies the width and height of each rectangle in the grid, and the third specifies the number of rectangles in each direction
# To ensure that the grid and the study area have the same coordinates, some housekeeping is involved
# SpatialPoints() converts the points to a coordinate reference system (CRS), or projection (different packages use different terminology for the same concept)
# The CRS is created by wrapping the study area in projection(), then in CRS()
# For the purpose of this exercise, you don't need to worry about exactly what these functions do, only that this data manipulation is necessary to align the grid and the study area
# Now that you have that alignment, crop(), as the name suggests, crops the grid to the study area
# Finally, SpatialPixels() converts the raster cropped gridpoints to the equivalent sp object
# The acidity survey data, ca_geo, the missing value index, miss, the variogram, vgm, and the variogram model, v_model, have been pre-defined
# A rough outline of the study area is in an object called geo_bounds
# ca_geo, geo_bounds have been pre-defined
# ls.str()
# Plot the polygon and points
geo_bounds <- readRDS("./RInputFiles/ca_geo_bounds.RDS")
plot(geo_bounds)
points(ca_geo)
# Find the corners of the boundary
bbox(geo_bounds)
## min max
## x 537853.4 719269.2
## y 5536290.4 5657400.9
# Define a 2.5km square grid over the polygon extent. The first parameter is
# the bottom left corner.
grid <- GridTopology(c(537853, 5536290), c(2500, 2500), c(72, 48))
# Create points with the same coordinate system as the boundary
gridpoints <- SpatialPoints(grid, proj4string = CRS(raster::projection(geo_bounds)))
plot(gridpoints)
# Crop out the points outside the boundary
cropped_gridpoints <- raster::crop(gridpoints, geo_bounds)
plot(cropped_gridpoints)
# Convert to SpatialPixels
spgrid <- SpatialPixels(cropped_gridpoints)
coordnames(spgrid) <- c("x", "y")
plot(spgrid)
# The spatial pixel grid of the region, spgrid, and the variogram model of pH, v_model have been pre-defined
# spgrid, v_model have been pre-defined
# ls.str()
# Do kriging predictions over the grid
ph_grid <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = spgrid, model = v_model)
## [using universal kriging]
# Calc the probability of pH exceeding 7
ph_grid$pAlkaline <- 1 - pnorm(7, mean = ph_grid$var1.pred, sd = sqrt(ph_grid$var1.var))
# Map the probability of alkaline samples
spplot(ph_grid, zcol = "pAlkaline")
# The autoKrige() function in the automap package computes binned variograms, fits models, does model selection, and performs kriging by making multiple calls to the gstat functions you used previously
# It can be a great time-saver but you should always check the results carefully.
# autoKrige() can try several variogram model types
# In the example, you'll use a Matern variogram model, which is commonly used in soil and forestry analyses
# You can see a complete list of available models by calling vgm() with no arguments
# The acidity survey data, ca_geo, and the missing value index, miss, have been pre-defined
# ca_geo, miss are pre-defined
# ls.str()
# Kriging with linear trend, predicting over the missing points
ph_auto <- automap::autoKrige(
pH ~ x + y,
input_data = ca_geo[!miss, ],
new_data = ca_geo[miss, ],
model = "Mat"
)
## [using universal kriging]
# Plot the variogram, predictions, and standard error
plot(ph_auto)
# You can also use autoKrige() over the spgrid grid from the earlier exercise
# This brings together all the concepts that you've learned in the chapter
# That is, kriging is great for predicting missing data, plotting things on a grid is much clearer than plotting individual points, and automatic kriging is less hassle than manual kriging
# The acidity survey data, ca_geo, the missing value index, miss, the spatial pixel grid of the region, spgrid, the manual kriging grid model, ph_grid, and the variogram model of pH, v_model have been pre-defined
# ca_geo, miss, spgrid, ph_grid, v_model are pre-defined
# ls.str()
# Auto-run the kriging
ph_auto_grid <- automap::autoKrige(pH ~ x + y, input_data = ca_geo[!miss, ], new_data = spgrid)
## [using universal kriging]
# Remember predictions from manual kriging
plot(ph_grid)
# Plot predictions and variogram fit
plot(ph_auto_grid)
# Compare the variogram model to the earlier one
v_model
## model psill range kappa
## 1 Nug 0.1545116 0.00 0.0
## 2 Ste 0.1442007 14379.29 0.5
ph_auto_grid$var_model
## model psill range kappa
## 1 Nug 0.1846444 0.00 0
## 2 Ste 0.1092281 13085.13 2
Chapter 1 - Vector and Raster Spatial Data in R
Reading vector and raster data into R:
Getting to know your vector data:
Getting to know your raster data:
Example code includes:
# Load the sf package
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
# Read in the trees shapefile
trees <- st_read("./RInputFiles/ZIP Files/trees/trees.shp")
## Reading layer `trees' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\trees\trees.shp' using driver `ESRI Shapefile'
## Simple feature collection with 65217 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.2546 ymin: 40.49894 xmax: -73.70078 ymax: 40.91165
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
# Read in the neighborhood shapefile
neighborhoods <- st_read("./RInputFiles/ZIP Files/neighborhoods/neighborhoods.shp")
## Reading layer `neighborhoods' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\neighborhoods\neighborhoods.shp' using driver `ESRI Shapefile'
## Simple feature collection with 195 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
# Read in the parks shapefile
parks <- st_read("./RInputFiles/ZIP Files/parks/parks.shp")
## Reading layer `parks' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\parks\parks.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2008 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.25609 ymin: 40.49449 xmax: -73.70905 ymax: 40.91133
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
# View the first few trees
head(trees)
## Simple feature collection with 6 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 558423 QN76 -73.80057 40.67035 0 honeylocust Queens
## 2 286191 MN32 -73.95422 40.77393 0 Callery pear Manhattan
## 3 257044 QN70 -73.92309 40.76196 0 Chinese elm Queens
## 4 603262 BK09 -73.99866 40.69312 0 cherry Brooklyn
## 5 41769 SI22 -74.11773 40.63166 0 cherry Staten Island
## 6 24024 SI07 -74.13116 40.62351 0 red maple Staten Island
## geometry
## 1 POINT (-73.80057 40.67035)
## 2 POINT (-73.95422 40.77393)
## 3 POINT (-73.92309 40.76196)
## 4 POINT (-73.99866 40.69312)
## 5 POINT (-74.11773 40.63166)
## 6 POINT (-74.13116 40.62351)
# The term "raster" refers to gridded data that can include satellite imagery, aerial photographs (like orthophotos) and other types
# In R, raster data can be handled using the raster package created by Robert J. Hijmans
# When working with raster data, one of the most important things to keep in mind is that the raw data can be what is known as "single-band" or "multi-band" and these are handled a little differently in R
# Single-band rasters are the simplest, these have a single layer of raster values -- a classic example would be an elevation raster where each cell value represents the elevation at that location
# Multi-band rasters will have more than one layer. An example is a color aerial photo in which there would be one band each representing red, green or blue light.
# Load the raster package
library(raster)
## Loading required package: sp
##
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# Read in the tree canopy single-band raster
canopy <- raster("./RInputFiles/ZIP Files/canopy/canopy.tif")
# Read in the manhattan Landsat image multi-band raster
manhattan <- brick("./RInputFiles/ZIP Files/manhattan/manhattan.tif")
# Get the class for the new objects
class(canopy)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
class(manhattan)
## [1] "RasterBrick"
## attr(,"package")
## [1] "raster"
# Identify how many layers each object has
nlayers(canopy)
## [1] 1
nlayers(manhattan)
## [1] 3
# As mentioned in the video, spatial objects in sf are just data frames with some special properties
# This means that packages like dplyr can be used to manipulate sf objects
# In this exercise, you will use the dplyr functions select() to select or drop variables, filter() to filter the data and mutate() to add or alter columns
# Load the dplyr and sf packages
# library(dplyr)
# library(sf)
# Read in the trees shapefile (already read in above)
# trees <- st_read("trees.shp")
# Use filter() to limit to honey locust trees
honeylocust <- trees %>% filter(species == "honeylocust")
# Count the number of rows
nrow(honeylocust)
## [1] 6418
# Limit to tree_id and boroname variables
honeylocust_lim <- honeylocust %>% dplyr::select(tree_id, boroname)
# Use head() to look at the first few records
head(honeylocust_lim)
## Simple feature collection with 6 features and 2 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -73.97524 ymin: 40.67035 xmax: -73.80057 ymax: 40.83136
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id boroname geometry
## 1 558423 Queens POINT (-73.80057 40.67035)
## 2 548625 Brooklyn POINT (-73.97524 40.68575)
## 3 549439 Brooklyn POINT (-73.94137 40.67557)
## 4 181757 Brooklyn POINT (-73.89377 40.67169)
## 5 379387 Queens POINT (-73.8221 40.69365)
## 6 383562 Bronx POINT (-73.90041 40.83136)
# In this exercise, you will convert the data frame to what's called a tibble with tibble::as_tibble() (Note that dplyr::tbl_df() is now deprecated)
# tibble is loaded in your workspace
# Create a standard, non-spatial data frame with one column
df <- data.frame(a = 1:3)
# Add a list column to your data frame
df$b <- list(1:4, 1:5, 1:10)
# Look at your data frame with head
head(df)
## a b
## 1 1 1, 2, 3, 4
## 2 2 1, 2, 3, 4, 5
## 3 3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# Convert your data frame to a tibble and print on console
as_tibble(df)
## # A tibble: 3 x 2
## a b
## <int> <list>
## 1 1 <int [4]>
## 2 2 <int [5]>
## 3 3 <int [10]>
# Pull out the third observation from both columns individually
df$a[3]
## [1] 3
df$b[3]
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
# There are several functions in sf that allow you to access geometric information like area from your vector features
# For example, the functions st_area() and st_length() return the area and length of your features, respectively
# Note that the result of functions like st_area() and st_length() will not be a traditional vector
# Instead the result has a class of units which means the vector result is accompanied by metadata describing the object's units
# you need to either remove the units with unclass()
# or you need to convert val's class to units such as with units(val) <- units(result)
# sf and dplyr are loaded in your workspace
# Read in the parks shapefile (already read in above)
# parks <- st_read("parks.shp")
# Compute the areas of the parks
areas <- st_area(parks)
# Create a quick histogram of the areas using hist
hist(areas, xlim = c(0, 200000), breaks = 1000)
# Filter to parks greater than 30000 (square meters)
big_parks <- parks %>% filter(unclass(areas) > 30000)
# Plot just the geometry of big_parks
plot(st_geometry(big_parks))
# Plot the parks object using all defaults
plot(parks)
## Warning: plotting the first 9 out of 14 attributes; use max.plot = 14 to
## plot all
# Plot just the acres attribute of the parks data
plot(parks["acres"])
# Create a new object of just the parks geometry
parks_geo <- st_geometry(parks)
# Plot the geometry of the parks data
plot(parks_geo)
# Load the raster package
# library(raster)
# Read in the rasters (done previously)
# canopy <- raster("canopy.tif")
# manhattan <- brick("manhattan.tif")
# Get the extent of the canopy object
extent(canopy)
## class : Extent
## xmin : 1793685
## xmax : 1869585
## ymin : 2141805
## ymax : 2210805
# Get the CRS of the manhattan object
crs(manhattan)
## CRS arguments:
## +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Determine the number of grid cells in both raster objects
ncell(manhattan)
## [1] 619173
ncell(canopy)
## [1] 58190
# Raster data can be very big depending on the extent and resolution (grid size)
# In order to deal with this the raster() and brick() functions are designed to only read in the actual raster values as needed
# To show that this is true, you can use the inMemory() function on an object and it will return FALSE if the values are not in memory
# If you use the head() function, the raster package will read in only the values needed, not the full set of values
# The raster values will be read in by default if you perform spatial analysis operations that require it or you can read in the values from a raster manually with the function getValues()
graphics.off()
# Check if the data is in memory
inMemory(canopy)
## [1] FALSE
# Use head() to peak at the first few records
head(canopy)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 0.00 19.35 47.88 17.17 54.27 70.93 81.18 84.23 88.86 87.17 82.27 81.65
## 2 0.00 10.65 58.61 28.77 51.19 53.65 85.29 88.81 89.00 84.59 79.00 87.18
## 3 0.00 0.00 17.26 28.72 49.04 43.84 76.13 83.78 88.30 76.47 84.44 69.35
## 4 0.00 0.96 23.81 64.48 38.24 36.16 79.26 87.02 83.80 70.21 34.33 16.14
## 5 0.00 6.97 38.18 81.83 48.02 52.84 71.18 87.21 76.72 72.90 7.12 47.38
## 6 14.89 31.42 34.17 51.29 85.26 70.05 74.99 83.52 83.98 75.74 41.93 84.72
## 7 65.06 37.87 30.91 23.92 35.21 53.85 85.32 85.59 85.63 76.34 77.72 81.97
## 8 68.95 43.38 37.51 22.02 27.54 72.25 84.80 86.20 86.14 84.44 82.56 67.32
## 9 58.23 33.00 43.03 12.07 19.64 76.00 76.35 76.53 83.94 85.48 83.76 41.86
## 10 46.31 53.63 23.67 10.73 48.16 60.86 63.47 69.98 61.79 55.78 34.47 49.32
## 13 14 15 16 17 18 19 20
## 1 77.95 80.72 64.98 80.05 66.04 34.45 28.03 54.67
## 2 73.62 84.57 72.14 83.08 72.75 13.29 33.37 42.57
## 3 66.02 83.02 72.60 77.81 60.89 49.96 27.29 41.00
## 4 72.36 82.51 72.86 78.84 70.87 30.37 47.14 55.99
## 5 75.38 87.18 74.90 81.76 60.04 24.07 45.37 52.69
## 6 85.36 84.52 65.53 68.28 59.58 52.38 31.21 33.98
## 7 65.10 63.87 48.16 30.90 39.24 29.74 5.78 14.39
## 8 6.58 51.03 21.22 40.49 29.76 22.16 6.76 46.05
## 9 9.47 23.15 50.56 44.56 19.28 32.92 52.94 43.42
## 10 8.55 21.59 55.36 54.16 45.76 47.89 59.27 47.48
# Use getValues() to read the values into a vector
vals <- getValues(canopy)
# Use hist() to create a histogram of the values
hist(vals)
# The raster package has added useful methods for plotting both single and multi-band rasters
# For single-band rasters or for a map of each layer in a multi-band raster you can simply use plot()
# If you have a multi-band raster with layers for red, green and blue light you can use the plotRGB() function to plot the raster layers together as a single image
# Plot the canopy raster (single raster)
plot(canopy)
# Plot the manhattan raster (as a single image for each layer)
plot(manhattan)
# Plot the manhattan raster as an image
plotRGB(manhattan)
# raster masks dplyr::select
detach("package:raster")
Chapter 2 - Preparing Layers for Spatial Analysis
Quick refresher on coordinate reference systems (CRS):
Manipulating vector layers with dplyr:
Converting sf objects into sp objects and coordinates:
Manipulating raster layers:
Example code includes:
library(sf)
library(raster)
##
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# In order to perform any spatial analysis with more than one layer, your layers should share the same coordinate reference system (CRS) and the first step is determining what coordinate reference system your data has
# To do this you can make use of the sf function st_crs() and the raster function crs()
# When the geographic data you read in with sf already has a CRS defined both sf and raster will recognize and retain it
# When the CRS is not defined you will need to define it yourself using either the EPSG number or the proj4string
# Determine the CRS for the neighborhoods and trees vector objects
st_crs(neighborhoods)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
# Assign the CRS to trees
crs_1 <- "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees) <- crs_1
# Determine the CRS for the canopy and manhattan rasters
crs(canopy)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
crs(manhattan)
## CRS arguments:
## +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Assign the CRS to manhattan
crs_2 <- "+proj=utm +zone=18 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"
crs(manhattan) <- crs_2
# In this exercise you will transform (sometimes this is called "project") the objects so they share a single CRS
# It is generally best to perform spatial analysis with layers that have a projected CRS (and some functions require this)
# To determine if your object has a projected CRS you can look at the first part of the result from st_crs() or crs() -- if it begins with +proj=longlat then your CRS is unprojected
# Note that you will use method = "ngb" in your call to projectRaster() to prevent distortion in the manhattan image
# Get the CRS from the canopy object
the_crs <- crs(canopy, asText = TRUE)
# Project trees to match the CRS of canopy
trees_crs <- st_transform(trees, crs = the_crs)
# Project neighborhoods to match the CRS of canopy
neighborhoods_crs <- st_transform(neighborhoods, crs = the_crs)
# Project manhattan to match the CRS of canopy
manhattan_crs <- projectRaster(manhattan, crs = the_crs, method = "ngb")
# Look at the CRS to see if they match
st_crs(trees_crs)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
st_crs(neighborhoods_crs)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
crs(manhattan_crs)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# If the layers do not share a common CRS they may not align on a plot
# To illustrate, in this exercise, you will initially create a plot with the plot() function and try to add two layers that do not share the same CRS
# You will then transform one layer's CRS to match the other and you will plot this with both the plot() function and functions from the tmap package.
# Note that for this exercise we returned all the layers to their original CRS and did not retain the changes you made in the last exercise
# With the plot() function you can plot multiple layers on the same map by calling plot() multiple times
# You'll need to add the argument add = TRUE to all calls to plot() after the first one and you need to run the code for all layers at once rather than line-by-line
# Plot canopy and neighborhoods (run both lines together)
# Do you see the neighborhoods?
plot(canopy)
plot(neighborhoods$geometry, add = TRUE)
# See if canopy and neighborhoods share a CRS
st_crs(neighborhoods)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
crs(canopy)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# Save the CRS of the canopy layer
the_crs <- crs(canopy, asText = TRUE)
# Transform the neighborhoods CRS to match canopy
neighborhoods_crs <- st_transform(neighborhoods, crs=the_crs)
# Re-run plotting code (run both lines together)
# Do the neighborhoods show up now?
plot(canopy)
plot(neighborhoods_crs$geometry, add = TRUE)
# Simply run the tmap code
tmap::tm_shape(canopy) +
tmap::tm_rgb() +
tmap::tm_shape(neighborhoods_crs) +
tmap::tm_polygons(alpha = 0.5)
# One of the great innovations of sf over sp is the use of data frames for storing spatial objects
# This allows you to slice and dice your spatial data in the same way you do for non-spatial data
# This means you can, for example, apply dplyr verbs directly to your sf object
# One important difference between dplyr with and without spatial data is that the resulting data frames will include the geometry variable unless you explicitly drop it
# If you want to force the geometry to be dropped you would use the sf function st_set_geometry() and you would set the geometry to NULL
# The packages sf and dplyr, and the object trees are loaded in your workspace
# Create a data frame of counts by species
species_counts <- count(trees, species)
# Arrange in descending order
species_counts_desc <- arrange(species_counts, desc(n))
# Use head to see if the geometry column is in the data frame
head(species_counts_desc)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -74.25443 ymin: 40.49894 xmax: -73.70104 ymax: 40.91165
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## # A tibble: 6 x 3
## species n geometry
## <fct> <int> <sf_geometry [degree]>
## 1 London planetree 8709 MULTIPOINT (-74.25408 40.50...
## 2 honeylocust 6418 MULTIPOINT (-74.25426 40.50...
## 3 Callery pear 5902 MULTIPOINT (-74.25443 40.50...
## 4 pin oak 5355 MULTIPOINT (-74.25329 40.50...
## 5 Norway maple 3373 MULTIPOINT (-74.25443 40.50...
## 6 littleleaf linden 3043 MULTIPOINT (-74.25032 40.51...
# Drop the geometry column
species_no_geometry <- st_set_geometry(species_counts_desc, NULL)
# Confirm the geometry column has been dropped
head(species_no_geometry)
## # A tibble: 6 x 2
## species n
## <fct> <int>
## 1 London planetree 8709
## 2 honeylocust 6418
## 3 Callery pear 5902
## 4 pin oak 5355
## 5 Norway maple 3373
## 6 littleleaf linden 3043
# In this exercise you will test joining spatial and non-spatial data. In particular, the trees data you have been working with has a full county name (the variable is called boroname) but does not have the county codes. The neighborhoods file has both a county name (the variable is called boro_name) and the county codes -- neighborhoods are nested within counties
# In this exercise, you will create a non-spatial data frame of county name and county code from the neighborhoods object
# Then you will join this data frame into the spatial trees object with inner_join()
# The packages sf and dplyr and the objects neighborhoods and trees are loaded in your workspace
# Limit to the fields boro_name, county_fip and boro_code
boro <- dplyr::select(neighborhoods, boro_name, county_fip, boro_code)
# Drop the geometry column
boro_no_geometry <- st_set_geometry(boro, NULL)
# Limit to distinct records
boro_distinct <- distinct(boro_no_geometry)
# Join the county detail into the trees object
trees_with_county <- inner_join(trees, boro_distinct, by = c("boroname" = "boro_name"))
# Confirm the new fields county_fip and boro_code exist
head(trees_with_county)
## Simple feature collection with 6 features and 9 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 558423 QN76 -73.80057 40.67035 0 honeylocust Queens
## 2 286191 MN32 -73.95422 40.77393 0 Callery pear Manhattan
## 3 257044 QN70 -73.92309 40.76196 0 Chinese elm Queens
## 4 603262 BK09 -73.99866 40.69312 0 cherry Brooklyn
## 5 41769 SI22 -74.11773 40.63166 0 cherry Staten Island
## 6 24024 SI07 -74.13116 40.62351 0 red maple Staten Island
## county_fip boro_code geometry
## 1 081 4 POINT (-73.80057 40.67035)
## 2 061 1 POINT (-73.95422 40.77393)
## 3 081 4 POINT (-73.92309 40.76196)
## 4 047 3 POINT (-73.99866 40.69312)
## 5 085 5 POINT (-74.11773 40.63166)
## 6 085 5 POINT (-74.13116 40.62351)
# In sf you can use the st_simplify() function to reduce line and polygon complexity
# In this exercise you will measure the size of objects before and after st_simplify() in two ways
# You will compute the size in megabytes using the handy object_size() function in the pryr package and you will count the number of vertices -- the number of points required to delineate a line or polygon
# The packages sf and pryr are loaded in your workspace
# Plot the neighborhoods geometry
plot(st_geometry(neighborhoods), col = "grey")
# Measure the size of the neighborhoods object
utils::object.size(neighborhoods)
## 1890408 bytes
# Compute the number of vertices in the neighborhoods object
pts_neighborhoods <- st_cast(neighborhoods$geometry, "MULTIPOINT")
cnt_neighborhoods <- sapply(pts_neighborhoods, length)
sum(cnt_neighborhoods)
## [1] 210736
# Simplify the neighborhoods object
neighborhoods_simple <- st_simplify(neighborhoods,
preserveTopology = TRUE,
dTolerance = 0.0025)
## Warning in st_simplify.sfc(st_geometry(x), preserveTopology, dTolerance):
## st_simplify does not correctly simplify longitude/latitude data, dTolerance
## needs to be in decimal degrees
# Measure the size of the neighborhoods_simple object
utils::object.size(neighborhoods_simple)
## 248464 bytes
# Compute the number of vertices in the neighborhoods_simple object
pts_neighborhoods_simple <- st_cast(neighborhoods_simple$geometry, "MULTIPOINT")
cnt_neighborhoods_simple <- sapply(pts_neighborhoods_simple, length)
sum(cnt_neighborhoods_simple)
## [1] 4766
# Plot the neighborhoods_simple object geometry
plot(st_geometry(neighborhoods_simple), col = "grey")
# Read in the trees data (done previously)
# trees <- st_read("trees.shp")
# Convert to Spatial class
trees_sp <- as(trees, Class = "Spatial")
# Confirm conversion, should be "SpatialPointsDataFrame"
class(trees_sp)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
# Convert back to sf
trees_sf <- st_as_sf(trees_sp)
# Confirm conversion
class(trees_sf)
## [1] "sf" "data.frame"
# In order to convert a data frame of coordinates into an sf object you can make use of the st_as_sf() function you used in the previous exercise
# You can specify the coords argument with the names of the coordinate variables (with the X coordinate/longitude coordinate listed first) and, optionally, the crs argument if you know the CRS of your coordinates
# The CRS can be specified as a proj4 string or EPSG code
# If you want to convert your sf point objects to a data frame with coordinates, you can use the st_write() function with a
# hidden argument (these are arguments associated with an external utility called GDAL and so they're not in the R help) to force sf to include the coordinates in the output file
# The argument you need is layer_options = "GEOMETRY=AS_XY"
# Read in the CSV (done previously)
# trees <- read.csv("trees.csv")
# Convert the data frame to an sf object
trees_sf <- st_as_sf(trees, coords = c("longitude", "latitude"), crs = 4326)
# Plot the geometry of the points
plot(st_geometry(trees_sf))
# Write the file out with coordinates
st_write(trees_sf, "./RInputFiles/new_trees.csv", layer_options = "GEOMETRY=AS_XY", delete_dsn = TRUE)
## Deleting source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## Writing layer `new_trees' to data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## options: GEOMETRY=AS_XY
## features: 65217
## fields: 7
## geometry type: Point
# Read in the file you just created and check coordinates
new_trees <- read.csv("./RInputFiles/new_trees.csv")
head(new_trees)
## X Y tree_id nta longitude latitude stump_diam
## 1 -73.80057 40.67035 558423 QN76 -73.80057 40.67035 0
## 2 -73.95422 40.77393 286191 MN32 -73.95422 40.77393 0
## 3 -73.92309 40.76196 257044 QN70 -73.92309 40.76196 0
## 4 -73.99866 40.69312 603262 BK09 -73.99866 40.69312 0
## 5 -74.11773 40.63166 41769 SI22 -74.11773 40.63166 0
## 6 -74.13116 40.62351 24024 SI07 -74.13116 40.62351 0
## species boroname
## 1 honeylocust Queens
## 2 Callery pear Manhattan
## 3 Chinese elm Queens
## 4 cherry Brooklyn
## 5 cherry Staten Island
## 6 red maple Staten Island
# Read in the canopy layer (done previously)
# canopy <- raster("canopy.tif")
# Plot the canopy raster
plot(canopy)
# Determine the raster resolution
res(canopy)
## [1] 300 300
# Determine the number of cells
ncell(canopy)
## [1] 58190
# Aggregate the raster
canopy_small <- aggregate(canopy, fact = 10)
# Plot the new canopy layer
plot(canopy_small)
# Determine the new raster resolution
res(canopy_small)
## [1] 3000 3000
# Determine the number of cells in the new raster
ncell(canopy_small)
## [1] 598
# Plot the canopy layer to see the values above 100
plot(canopy)
# Set up the matrix
vals <- cbind(100, 300, NA)
# Reclassify
canopy_reclass <- reclassify(canopy, rcl = vals)
# Plot again and confirm that the legend stops at 100
plot(canopy_reclass)
# raster masks dplyr::select
detach("package:raster")
Chapter 3 - Conducting Spatial Analysis with sf and raster
Buffers and centroids:
Bounding boxes, dissolve features and create a convex hull:
Multi-layer geoprocessing and relationships:
Geoprocessing with rasters:
Example code includes:
library(raster)
##
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# Computing buffers is a key spatial analysis skill and the resulting buffers have a wide range of uses like, for example, identifying the number of roads within one kilometer of a school
# or computing the number of hazardous waste sites near sensitive natural areas
# Although, technically you can buffer data with unprojected coodinate reference systems, the buffer distance will be more meaningful with a projected CRS
# so it is highly recommended that you transform unprojected data to a projected CRS before buffering
df <- data.frame(place=c("Empire State Building", "Museum of Natural History"),
longitude=c(-73.98566, -73.97398),
latitude=c(40.74844, 40.78132),
stringsAsFactors = TRUE
)
# Review df
df
## place longitude latitude
## 1 Empire State Building -73.98566 40.74844
## 2 Museum of Natural History -73.97398 40.78132
# Convert the data frame to an sf object
df_sf <- st_as_sf(df, coords = c("longitude", "latitude"), crs=4326)
# Transform the points to match the manhattan CRS
df_crs <- st_transform(df_sf, crs = crs(manhattan, asText = TRUE))
# Buffer the points
df_buf <- st_buffer(df_crs, dist = 1000)
# Plot the manhattan image (it is multi-band)
plotRGB(manhattan)
plot(st_geometry(df_buf), col = "firebrick", add = TRUE)
plot(st_geometry(df_crs), pch = 16, add = TRUE)
# Similar to buffering, computing polygon centroids is a bedrock geoprocessing task used to assign values and even to help with labeling maps. The function for this in sf is st_centroid()
# Also similar to buffering, centroid calculations should generally be performed on data with a projected coordinate reference system
# Read in the neighborhods shapefile (done previously)
# neighborhoods <- st_read("neighborhoods.shp")
# Project neighborhoods to match manhattan
neighborhoods_tf <- st_transform(neighborhoods, crs = 32618)
# Compute the neighborhood centroids
centroids <- st_centroid(neighborhoods_tf)
# Plot the neighborhood geometry
plot(st_geometry(neighborhoods_tf), col = "grey", border = "white")
plot(centroids$geometry, pch = 16, col = "firebrick", add = TRUE)
# You can compute bounding boxes around vector data using sf
# These can help you, for example, create polygons to clip layers to a common area for an analysis or identify regions of influence
# In the sf package, there is a function for extracting the bounding box coordinates, if that's all you need, this is st_bbox()
# More likely you'll want to create a new sf object (a polygon) from those coordinates and to do this sf provides the st_make_grid() function
# st_make_grid() can be used to make a multi-row and multi-column grid covering your input data but it can also be used to make a grid of just one cell (a bounding box)
# To do this, you need to specify the number of grid cells as n = 1
# Use filter() to limit to honey locust trees
beech <- trees %>% filter(species %in% c("European beech", "American beech"))
str(beech)
## Classes 'sf' and 'data.frame': 27 obs. of 8 variables:
## $ tree_id : num 182961 163271 221707 16250 183728 ...
## $ nta : Factor w/ 188 levels "BK09","BK17",..: 169 107 119 58 180 146 106 135 169 171 ...
## $ longitude : num -73.9 -74 -73.8 -73.8 -74.1 ...
## $ latitude : num 40.8 40.8 40.7 40.8 40.6 ...
## $ stump_diam: num 0 0 0 0 0 0 0 0 0 0 ...
## $ species : Factor w/ 132 levels "'Schubert' chokecherry",..: 50 50 2 50 2 2 2 50 2 2 ...
## $ boroname : Factor w/ 5 levels "Bronx","Brooklyn",..: 4 3 4 1 5 4 3 4 4 5 ...
## $ geometry :sfc_POINT of length 27; first list element: Classes 'XY', 'POINT', 'sfg' num [1:2] -73.9 40.8
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA
## ..- attr(*, "names")= chr "tree_id" "nta" "longitude" "latitude" ...
# Plot the neighborhoods and beech trees
plot(st_geometry(neighborhoods), col = "grey", border = "white")
plot(beech$geometry, add = TRUE, pch = 16, col = "forestgreen")
# Compute the coordinates of the bounding box
st_bbox(beech)
## xmin ymin xmax ymax
## -74.17746 40.54247 -73.70872 40.85696
# Create a bounding box polygon
beech_box <- st_make_grid(beech, n = 1)
# Plot the neighborhoods, add the beech trees and add the new box
plot(st_geometry(neighborhoods), col = "grey", border = "white")
plot(beech$geometry, add = TRUE, pch = 16, col = "forestgreen")
plot(beech_box, add = TRUE)
# In order to compute a tighter bounding box, a convex hull, around a set of points like the beech trees from the previous exercise you'll need to learn one more function first
# For points you don't want a convex hull around each point! This doesn't even make sense
# More likely you want to compute a convex hull around all your points
# If you have a set of points and you want to draw a convex hull around them you first need to bundle the points into a single MULTIPOINT feature and in order to do this you will use the dissolve function in sf called st_union()
# With polygons, st_union() will dissolve all the polygons into a single polygon representing the area where all the polygons overlap
# Your set of individual points will be dissolved/unioned into a single, MULTIPOINT feature that you can use for tasks like computing the convex hull
# Buffer the beech trees by 3000
beech_buffer <- st_buffer(beech, 0.025)
## Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs): st_buffer does
## not correctly buffer longitude/latitude data
## dist is assumed to be in decimal degrees (arc_degrees).
# Limit the object to just geometry
beech_buffers <- st_geometry(beech_buffer)
# Compute the number of features in beech_buffer
length(beech_buffers)
## [1] 27
# Plot the tree buffers
plot(beech_buffers)
# Dissolve the buffers
beech_buf_union <- st_union(beech_buffers)
# Compute the number of features in beech_buf_union
length(beech_buf_union)
## [1] 1
# Plot the dissolved buffers
plot(beech_buf_union)
# A more precise bounding polygon is sometimes needed, one that fits your data more neatly
# For this, you can use the st_convex_hull() function
# Note that st_convex_hull() will compute a tight box around each one of your features individually so if you want to create a convex hull around a
# group of features you'll need to use st_union() to combine individual features into a single multi-feature
# Look at the data frame to see the type of geometry
head(beech)
## Simple feature collection with 6 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.12843 ymin: 40.56829 xmax: -73.71567 ymax: 40.84684
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 182961 QN72 -73.90401 40.76897 0 European beech Queens
## 2 163271 MN31 -73.95473 40.77224 0 European beech Manhattan
## 3 221707 QN06 -73.79034 40.72523 0 American beech Queens
## 4 16250 BX10 -73.78911 40.84684 0 European beech Bronx
## 5 183728 SI25 -74.12843 40.56829 0 American beech Staten Island
## 6 591657 QN43 -73.71567 40.73604 0 American beech Queens
## geometry
## 1 POINT (-73.90401 40.76897)
## 2 POINT (-73.95473 40.77224)
## 3 POINT (-73.79034 40.72523)
## 4 POINT (-73.78911 40.84684)
## 5 POINT (-74.12843 40.56829)
## 6 POINT (-73.71567 40.73604)
# Convert the points to a single multi-point
beech1 <- st_union(beech)
# Look at the data frame to see the type of geometry
head(beech1)
## Geometry set for 1 feature
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -74.17746 ymin: 40.54247 xmax: -73.70872 ymax: 40.85696
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## MULTIPOINT (-74.17746 40.54247, -74.13941 40.61...
# Confirm that we went from 17 features to 1 feature
length(beech)
## [1] 8
length(beech1)
## [1] 1
# Compute the tight bounding box
beech_hull <- st_convex_hull(beech1)
# Plot the points together with the hull
plot(beech_hull, col = "red")
plot(beech1, add = TRUE)
# For many analysis types you need to link geographies spatially
# For example, you want to know how many trees are in each neighborhood but you don't have a neighborhood attribute in the tree data
# The best way to do this is with a spatial join using st_join()
# Importantly, the st_join() function requires sf data frames as input and will not accept an object that is just sf geometry
# You can use the st_sf() function to convert sf geometry objects to an sf data frame (st_sf() is essentially the opposite of st_geometry())
# Plot the beech on top of the neighborhoods
plot(st_geometry(neighborhoods))
plot(beech$geometry, add = TRUE, pch = 16, col = "red")
# Determine whether beech has class data.frame
class(beech)
## [1] "sf" "data.frame"
# Convert the beech geometry to a sf data frame
beech_df <- st_sf(beech)
# Confirm that beech now has the data.frame class
class(beech_df)
## [1] "sf" "data.frame"
# Join the beech trees with the neighborhoods
beech_neigh <- st_join(beech_df, neighborhoods)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Confirm that beech_neigh has the neighborhood information
head(beech_neigh)
## Simple feature collection with 6 features and 12 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.12843 ymin: 40.56829 xmax: -73.71567 ymax: 40.84684
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 182961 QN72 -73.90401 40.76897 0 European beech Queens
## 2 163271 MN31 -73.95473 40.77224 0 European beech Manhattan
## 3 221707 QN06 -73.79034 40.72523 0 American beech Queens
## 4 16250 BX10 -73.78911 40.84684 0 European beech Bronx
## 5 183728 SI25 -74.12843 40.56829 0 American beech Staten Island
## 6 591657 QN43 -73.71567 40.73604 0 American beech Queens
## county_fip boro_name ntacode ntaname
## 1 081 Queens QN72 Steinway
## 2 061 Manhattan MN31 Lenox Hill-Roosevelt Island
## 3 081 Queens QN06 Jamaica Estates-Holliswood
## 4 005 Bronx BX10 Pelham Bay-Country Club-City Island
## 5 085 Staten Island SI25 Oakwood-Oakwood Beach
## 6 081 Queens QN43 Bellerose
## boro_code geometry
## 1 4 POINT (-73.90401 40.76897)
## 2 1 POINT (-73.95473 40.77224)
## 3 4 POINT (-73.79034 40.72523)
## 4 2 POINT (-73.78911 40.84684)
## 5 5 POINT (-74.12843 40.56829)
## 6 4 POINT (-73.71567 40.73604)
# In this exercise you will determine which neighborhoods are at least partly within 2000 meters of the Empire State Building with st_intersects()
# and those that are completely within 2000 meters of the Empire State Building using st_contains()
# You will then use the st_intersection() function (notice the slight difference in function name!) to clip the neighborhoods to the buffer
# A note about the output of functions that test relationships between two sets of features
# The output of these and related functions is a special kind of list (with the class sgbp)
# For example, when using st_intersects(), the first element in the output can be accessed using [[1]], which shows polygons from the second polygon that intersect with the first polygon
# Likewise, [[2]] would show the polygons from from the first polygon that intersect with the second polygon
# Review df
df
## place longitude latitude
## 1 Empire State Building -73.98566 40.74844
## 2 Museum of Natural History -73.97398 40.78132
df_mod <- df %>% filter(place == "Empire State Building")
df_sf_mod <- st_as_sf(df_mod, coords = c("longitude", "latitude"), crs=4326)
df_crs_mod <- st_transform(df_sf_mod, crs = crs(manhattan, asText = TRUE))
buf_mod <- st_buffer(df_crs_mod, dist = 2000)
buf <- st_transform(buf_mod, "+proj=longlat +ellps=WGS84 +no_defs")
# Identify neighborhoods that intersect with the buffer
neighborhoods_int <- st_intersects(buf, neighborhoods)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Identify neighborhoods contained by the buffer
neighborhoods_cont <- st_contains(buf, neighborhoods)
## although coordinates are longitude/latitude, st_contains assumes that they are planar
# Get the indexes of which neighborhoods intersect
# and are contained by the buffer
int <- neighborhoods_int[[1]]
cont <- neighborhoods_cont[[1]]
# Get the names of the names of neighborhoods in buffer
neighborhoods$ntaname[int]
## [1] Clinton
## [2] Midtown-Midtown South
## [3] Turtle Bay-East Midtown
## [4] Murray Hill-Kips Bay
## [5] Gramercy
## [6] Hudson Yards-Chelsea-Flatiron-Union Square
## [7] West Village
## [8] Stuyvesant Town-Cooper Village
## [9] East Village
## 195 Levels: Airport ... Yorkville
# Clip the neighborhood layer by the buffer (ignore the warning)
neighborhoods_clip <- st_intersection(buf, neighborhoods)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
# Plot the geometry of the clipped neighborhoods
plot(st_geometry(neighborhoods_clip), col = "red")
plot(neighborhoods[cont,]$geometry, add = TRUE, col = "yellow")
# Of course, measuring distance between feature sets is a component of spatial analysis 101 -- a core skill for any analyst
# There are several functions in base R as well as in the packages rgeos and geosphere to compute distances, but the st_distance() function from sf
# provides a useful feature-to-feature distance matrix as output and can be used for most distance calculation needs
# In this exercise you'll measure the distance from the Empire State Building to all the parks and identify the closest one
# Read in the parks object (done previously)
# parks <- st_read("parks.shp")
empire_state <- df_crs_mod
str(empire_state)
## Classes 'sf' and 'data.frame': 1 obs. of 2 variables:
## $ place : Factor w/ 2 levels "Empire State Building",..: 1
## $ geometry:sfc_POINT of length 1; first list element: Classes 'XY', 'POINT', 'sfg' num [1:2] 585632 4511327
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
## ..- attr(*, "names")= chr "place"
# Test whether the CRS match
st_crs(empire_state) == st_crs(parks)
## [1] FALSE
# Project parks to match empire state
parks_es <- st_transform(parks, crs = st_crs(empire_state))
# Compute the distance between empire_state and parks_es
d <- st_distance(empire_state, parks_es)
# Take a quick look at the result
head(d)
## Units: m
## [1] 3055.791 19969.336 22737.903 13846.358 4604.069 18541.779
# Find the index of the nearest park
nearest <- which.min(d)
# Identify the park that is nearest
parks_es[nearest, ]
## Simple feature collection with 1 feature and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 585392 ymin: 4511320 xmax: 585418.2 ymax: 4511379
## epsg (SRID): 26918
## proj4string: +proj=utm +zone=18 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## location communityb
## 188 Broadway, Av of the Americas, bet. W. 32 St. and W. 33 St. 105
## nys_senate signname zipcode us_congres gispropnum borough
## 188 27 Greeley Square Park 10001 12 M032 M
## waterfront nys_assemb councildis acres typecatego address
## 188 No 75 3 0.144 Triangle/Plaza 894 6 AVENUE
## geometry
## 188 MULTIPOLYGON (((585411 4511...
# Mask and crop are similar operations that allow you to limit your raster to a specific area of interest
# With mask() you essentially place your area of interest on top of the raster and any raster cells outside of the boundary are assigned NA values
# A reminder that currently the raster package does not support sf objects so they will need to be converted to Spatial objects with, for example, as(input, "Spatial").
# Project parks to match canopy
parks_cp <- st_transform(parks, crs = crs(canopy, asText = TRUE))
# Compute the area of the parks
areas <- st_area(parks_cp)
# Filter to parks with areas > 30000
parks_big <- filter(parks_cp, unclass(areas) > 30000)
# Plot the canopy raster
plot(canopy)
# Plot the geometry of parks_big
plot(st_geometry(parks_big))
# Convert parks to a Spatial object
parks_sp <- as(parks_big, "Spatial")
# Mask the canopy layer with parks_sp and save as canopy_mask
canopy_mask <- mask(canopy, mask = parks_sp)
# Plot canopy_mask -- this is a raster!
plot(canopy_mask)
# As you saw in the previous exercise with mask(), the raster extent is not changed
# If the extents of the input raster and the mask itself are different then they will still be different after running mask()
# In many cases, however, you will want your raster to share an extent with another layer and this is where crop() comes in handy
# With crop() you are cropping the raster so that the extent (the bounding box) of the raster matches the extent of the input crop layer
# But within the bounding box no masking is done (no raster cells are set to NA)
# In this exercise you will both mask and crop the NYC canopy layer based on the large parks and you'll compare
# You should notice that the masked raster includes a lot of NA values (there are the whitespace) and that the extent is the same as the original canopy layer
# With the cropped layer you should notice that the extent of the cropped canopy layer matches the extent of the large parks (essentially it's zoomed in)
# Convert the parks_big layer (this is preloaded, it has been limited to large parks and projected) to a Spatial object with as() -- call this parks_sp
# Convert the parks_big to a Spatial object
parks_sp <- as(parks_big, "Spatial")
# Mask the canopy with the large parks
canopy_mask <- mask(canopy, mask = parks_sp)
# Plot the mask
plot(canopy_mask)
# Crop canopy with parks_sp
canopy_crop <- crop(canopy, parks_sp)
# Plot the cropped version and compare
plot(canopy_crop)
# Beyond simply masking and cropping you may want to know the actual cell values at locations of interest
# You might, for example, want to know the percentage canopy at your landmarks or within the large parks
# This is where the extract() function comes in handy
# Usefully, and you'll see this in a later analysis, you can feed extract() a function that will get applied to extracted cells
# For example, you can use extract() to extract raster values by neighborhood and with the fun = mean argument it will return an average cell value by neighborhood
# Similar to other raster functions, it is not yet set up to accept sf objects so you'll need to convert to a Spatial object
# Project the landmarks to match canopy
# landmarks_cp <- st_transform(landmarks, crs = crs(canopy, asText = TRUE))
# Convert the landmarks to a Spatial object
# landmarks_sp <- as(landmarks_cp, "Spatial")
# Extract the canopy values at the landmarks
# landmarks_ex <- extract(canopy, landmarks_sp)
# Look at the landmarks and extraction results
# landmarks_cp
# landmarks_ex
# You will now use the canopy layer and an "imperviousness" layer from the same source, the United States Geological Survey
# Imperviousness measures whether water can pass through a surface
# So a high percentage impervious surface might be a road that does not let water pass through while a low percentage impervious might be something like a lawn
# What you will do in this exercise is essentially identify the most urban locations by finding areas that have both a low percentage of tree canopy (< 20%) and high percentage of impervious (> 80%)
# To do this, we defined the function f to do the raster math for you
# Read in the canopy (already read in) and impervious layer
# canopy <- raster("canopy.tif")
impervious <- raster("./RInputFiles/ZIP Files/impervious/impervious.tif")
# Function f with 2 arguments and the raster math code
f <- function(rast1, rast2) {
rast1 < 20 & rast2 > 80
}
# Do the overlay using f as fun
canopy_imperv_overlay <- overlay(canopy, impervious, fun = f)
# Plot the result (low tree canopy and high impervious areas)
plot(canopy_imperv_overlay)
# raster masks dplyr::select
detach("package:raster")
Chapter 4 - Combine Skills in Mini-Analysis
Compute tree density and average tree canopy by neighborhood:
First look at results with ggplot2:
Create final, polished maps with tmap:
Wrap up:
Example code includes:
library(raster)
##
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# In order to compute tree density by neighborhood you need two things
# You will need to know the area of the neighborhoods, which you will compute in the next exercise
# And you need the tree counts by neighborhood which is the focus of this exercise
# You will produce counts of all trees by neighborhood in NYC and create a single data frame with a column for total trees
# The result should be a data frame with no geometry
# sf and dplyr are loaded in the workspace
# Compute the counts of all trees by hood (nta)
tree_counts <- count(trees, nta)
# Take a quick look
head(tree_counts)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -74.00396 ymin: 40.57265 xmax: -73.92026 ymax: 40.70249
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## # A tibble: 6 x 3
## nta n geometry
## <fct> <int> <sf_geometry [degree]>
## 1 BK09 174 MULTIPOINT (-73.99901 40.69...
## 2 BK17 499 MULTIPOINT (-73.95982 40.58...
## 3 BK19 132 MULTIPOINT (-73.97331 40.57...
## 4 BK21 136 MULTIPOINT (-74.00396 40.58...
## 5 BK23 53 MULTIPOINT (-73.98038 40.57...
## 6 BK25 396 MULTIPOINT (-73.9718 40.607...
# Remove the geometry
tree_counts_no_geom <- st_set_geometry(tree_counts, NULL)
# Rename the n variable to tree_cnt
tree_counts_renamed <- rename(tree_counts_no_geom, tree_cnt = n)
# Create histograms of the total counts
hist(tree_counts_renamed$tree_cnt)
# We have the tree counts (from the previous exercise)
# In this exercise you will compute neighborhood areas, add them to the neighborhood sf object and then you'll join in the non-spatial tree counts data frame from the previous exercise
# Compute areas and unclass
areas <- unclass(st_area(neighborhoods))
# Add the areas to the neighborhoods object
neighborhoods_area <- mutate(neighborhoods, area = areas)
# Join neighborhoods and counts
neighborhoods_counts <- left_join(neighborhoods_area, tree_counts_renamed, by = c("ntacode"="nta"))
## Warning: Column `ntacode`/`nta` joining factors with different levels,
## coercing to character vector
# Replace NA values with 0
neighborhoods_counts <- mutate(neighborhoods_counts,
tree_cnt = ifelse(is.na(tree_cnt),
0, tree_cnt))
# Compute the density
neighborhoods_counts <- mutate(neighborhoods_counts,
tree_density = tree_cnt/area)
# In the previous exercises you computed tree density by neighborhood using tree counts
# In this exercise you will compute average tree canopy by neighborhood as a percentage so that we can compare if the results are similar
# Confirm that you have the neighborhood density results
head(neighborhoods_counts)
## Simple feature collection with 6 features and 8 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.00736 ymin: 40.61264 xmax: -73.77574 ymax: 40.8355
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## county_fip boro_name ntacode ntaname boro_code area
## 1 047 Brooklyn BK88 Borough Park 3 5017229
## 2 081 Queens QN52 East Flushing 4 2736433
## 3 081 Queens QN48 Auburndale 4 3173995
## 4 081 Queens QN51 Murray Hill 4 4876380
## 5 081 Queens QN27 East Elmhurst 4 1832715
## 6 005 Bronx BX35 Morrisania-Melrose 2 1569317
## tree_cnt tree_density geometry
## 1 565 0.0001126120 MULTIPOLYGON (((-73.97605 4...
## 2 295 0.0001078046 MULTIPOLYGON (((-73.79493 4...
## 3 507 0.0001597356 MULTIPOLYGON (((-73.77574 4...
## 4 732 0.0001501114 MULTIPOLYGON (((-73.80379 4...
## 5 211 0.0001151297 MULTIPOLYGON (((-73.8611 40...
## 6 214 0.0001363650 MULTIPOLYGON (((-73.89697 4...
# Transform the neighborhoods CRS to match the canopy layer
neighborhoods_crs <- st_transform(neighborhoods_counts, crs = crs(canopy, asText = TRUE))
# Convert neighborhoods object to a Spatial object
neighborhoods_sp <- as(neighborhoods_crs, "Spatial")
# Compute the mean of canopy values by neighborhood
canopy_neighborhoods <- extract(canopy_small, neighborhoods_sp, fun = mean)
# Add the mean canopy values to neighborhoods
neighborhoods_avg_canopy <- mutate(neighborhoods_counts, avg_canopy = as.vector(canopy_neighborhoods))
# Create a histogram of tree density (tree_density)
ggplot(neighborhoods_avg_canopy, aes(x = tree_density)) +
geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create a histogram of average canopy (avg_canopy)
ggplot(neighborhoods_avg_canopy, aes(x = avg_canopy)) +
geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create a scatter plot of tree_density vs avg_canopy
ggplot(neighborhoods_avg_canopy, aes(x = tree_density, y = avg_canopy)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess'
# Compute the correlation between density and canopy
cor(neighborhoods_avg_canopy$tree_density, neighborhoods_avg_canopy$avg_canopy)
## [1] -0.08667411
# The geom_sf() function operates like any other layer in ggplot2 where you can link variables to aesthetics on the plot through the aes() function
# In a mapping context this might mean, for example, creating a choropleth map by color coding the polygons based on a variable
# If you leave off the aesthetic mapping geom_sf() will map the geometry alone
# Note: geom_sf() is still in the development version of ggplot2 on GitHub. If you want to use geom_sf() on your machine, you need to install the dev version
# devtools::install_github("tidyverse/ggplot2")
# Plot the tree density with default colors
# ggplot(neighborhoods_avg_canopy) +
# geom_sf(aes(fill = tree_density))
# Plot the tree canopy with default colors
# ggplot(neighborhoods) +
# geom_sf(aes(fill = avg_canopy))
# Plot the tree density using scale_fill_gradient()
# ggplot(neighborhoods) +
# geom_sf(aes(fill = tree_density)) +
# scale_fill_gradient(low = "#edf8e9", high = "#005a32")
# Plot the tree canopy using the scale_fill_gradient()
# ggplot(neighborhoods) +
# geom_sf(aes(fill = avg_canopy)) +
# scale_fill_gradient(low = "#edf8e9", high = "#005a32")
# Create a simple map of neighborhoods
library(tmap)
tm_shape(neighborhoods_avg_canopy) +
tm_polygons()
# Create a color-coded map of neighborhood tree density
tm_shape(neighborhoods_avg_canopy) +
tm_polygons(col="tree_density")
# Style the tree density map
tm_shape(neighborhoods_avg_canopy) +
tm_polygons("tree_density", palette = "Greens",
style = "quantile", n = 7,
title = "Trees per sq. KM")
# Create a similar map of average tree canopy
tm_shape(neighborhoods_avg_canopy) +
tm_polygons("avg_canopy", palette = "Greens",
style = "quantile", n = 7,
title = "Average tree canopy (%)")
# Create a map of the manhattan aerial photo
tm_shape(manhattan) +
tm_rgb()
# Create a map of the neighborhood polygons
tm_shape(neighborhoods_avg_canopy) +
tm_borders(col = "black", lwd = 0.5, alpha = 0.5)
# Combine the aerial photo and neighborhoods into one map
map1 <- tm_shape(manhattan) +
tm_rgb() +
tm_shape(neighborhoods_avg_canopy) +
tm_borders(col = "black", lwd = 0.5, alpha = 0.5)
# Create the second map of tree measures (bbox causing errors . . . )
# map2 <- tm_shape(neighborhoods_avg_canopy, bbox = bbox(manhattan)) +
map2 <- tm_shape(neighborhoods_avg_canopy) +
tm_polygons(c("tree_density", "avg_canopy"),
style = "quantile",
palette = "Greens",
title = c("Tree Density", "Average Tree Canopy"))
# Combine the two maps into one
tmap_arrange(map1, map2, asp = NA)
# raster masks dplyr::select
detach("package:raster")
Chapter 1 - Fast and Dirty - Polarity Scoring
Sentiment Analysis and Feelings:
Zipf’s Law, Number of Words, Subjectivity Lexicon:
Explore qdap - Polarity and Lexicon:
Example code includes:
# We created text_df representing a conversation with person and text columns
# Use qdap's polarity() function to score text_df
# polarity() will accept a single character object or data frame with a grouping variable to calculate a positive or negative score
# In this example you will use the magrittr package's dollar pipe operator %$%
# The dollar sign forwards the data frame into polarity() and you declare a text column name or the text column and a grouping variable without quotes
# text_data_frame %$% polarity(text_column_name)
# To create an object with the dollar sign operator:
# polarity_object <- text_data_frame %$%
# polarity(text_column_name, grouping_column_name)
# More specifically, to make a quantitative judgement about the sentiment of some text, you need to give it a score
# A simple method is a positive or negative value related to a sentence, passage or a collection of documents called a corpus
# Scoring with positive or negative values only is called "polarity."
# A useful function for extracting polarity scores is counts() applied to the polarity object
# For a quick visual call plot() on the polarity() outcome
# From http://magrittr.tidyverse.org/
# Many functions accept a data argument, e.g. lm and aggregate, which is very useful in a pipeline where data is first processed and then passed into such a function
# There are also functions that do not have a data argument, for which it is useful to expose the variables in the data
# This is done with the %$% operator
# iris %>%
# subset(Sepal.Length > mean(Sepal.Length)) %$%
# cor(Sepal.Length, Sepal.Width)
library(magrittr)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
##
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:ggplot2':
##
## %+%
## The following object is masked from 'package:dplyr':
##
## explain
## Loading required package: qdapTools
##
## Attaching package: 'qdapTools'
## The following object is masked from 'package:spatstat':
##
## shift
## The following object is masked from 'package:dplyr':
##
## id
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following object is masked from 'package:magrittr':
##
## %>%
## The following object is masked from 'package:sf':
##
## %>%
## The following object is masked from 'package:dplyr':
##
## %>%
## The following object is masked from 'package:base':
##
## Filter
text_df <- data.frame(
person=c('Nick', 'Jonathan', 'Martijn', 'Nicole', 'Nick', 'Jonathan', 'Martijn', 'Nicole'),
text=c('DataCamp courses are the best', 'I like talking to students', 'Other online data science curricula are boring.', 'What is for lunch?', 'DataCamp has lots of great content!', 'Students are passionate and are excited to learn', 'Other data science curriculum is hard to learn and difficult to understand', 'I think the food here is good.'),
stringsAsFactors=TRUE
)
# Examine the text data
text_df
## person
## 1 Nick
## 2 Jonathan
## 3 Martijn
## 4 Nicole
## 5 Nick
## 6 Jonathan
## 7 Martijn
## 8 Nicole
## text
## 1 DataCamp courses are the best
## 2 I like talking to students
## 3 Other online data science curricula are boring.
## 4 What is for lunch?
## 5 DataCamp has lots of great content!
## 6 Students are passionate and are excited to learn
## 7 Other data science curriculum is hard to learn and difficult to understand
## 8 I think the food here is good.
# Calc overall polarity score
text_df %$% qdap::polarity(text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 8 54 0.179 0.452 0.396
# Calc polarity score by person
(datacamp_conversation <- text_df %$% qdap::polarity(text, person))
## person total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Jonathan 2 13 0.577 0.184 3.141
## 2 Martijn 2 19 -0.478 0.141 -3.388
## 3 Nick 2 11 0.428 0.028 15.524
## 4 Nicole 2 11 0.189 0.267 0.707
# Counts table from datacamp_conversation
qdap::counts(datacamp_conversation)
## person wc polarity pos.words neg.words text.var
## 1 Nick 5 0.447 best - DataCamp courses are the best
## 2 Jonathan 5 0.447 like - I like talking to students
## 3 Martijn 7 -0.378 - boring Other online data science curricula are boring.
## 4 Nicole 4 0.000 - - What is for lunch?
## 5 Nick 6 0.408 great - DataCamp has lots of great content!
## 6 Jonathan 8 0.707 passionate, excited - Students are passionate and are excited to learn
## 7 Martijn 12 -0.577 - hard, difficult Other data science curriculum is hard to learn and difficult to understand
## 8 Nicole 7 0.378 good - I think the food here is good.
# Plot the conversation polarity
plot(datacamp_conversation)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
# In the Text Mining: Bag of Words course you learned that a corpus is a set of texts, and you studied some functions for preprocessing the text
# To recap, one way to create a corpus is with the functions below
# Even though this is a different course, sentiment analysis is part of text mining so a refresher can be helpful
# Turn a character vector into a text source using VectorSource().
# Turn a text source into a corpus using VCorpus().
# Remove unwanted characters from the corpus using cleaning functions like removePunctuation() and stripWhitespace() from tm, and replace_abbreviation() from qdap
# In this exercise a custom clean_corpus() function has been created using standard preprocessing functions for easier application
# clean_corpus() accepts the output of VCorpus() and applies cleaning functions. For example:
# processed_corpus <- clean_corpus(my_corpus)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
##
## ngrams
## The following object is masked from 'package:ggplot2':
##
## annotate
##
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
# Your R session has a text vector, tm_define, containing two small documents and the function clean_corpus().
tm_define <- c("Text mining is the process of distilling actionable insights from text.", "Sentiment analysis represents the set of tools to extract an author's feelings towards a subject.")
# clean_corpus(), tm_define are pre-defined
clean_corpus
## function(corpus){
## corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
## corpus <- tm_map(corpus, removePunctuation)
## corpus <- tm_map(corpus, removeNumbers)
## corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
## corpus <- tm_map(corpus, content_transformer(tolower))
## corpus <- tm_map(corpus, stripWhitespace)
## return(corpus)
## }
tm_define
## [1] "Text mining is the process of distilling actionable insights from text."
## [2] "Sentiment analysis represents the set of tools to extract an author's feelings towards a subject."
# Create a VectorSource
tm_vector <- VectorSource(tm_define)
# Apply VCorpus
tm_corpus <- VCorpus(tm_vector)
# Examine the first document's contents
content(tm_corpus[[1]])
## [1] "Text mining is the process of distilling actionable insights from text."
# Clean the text
tm_clean <- clean_corpus(tm_corpus)
# Reexamine the contents of the first doc
content(tm_clean[[1]])
## [1] "text mining process distilling actionable insights text"
# Now let's create a Document Term Matrix (DTM). In a DTM
# Each row of the matrix represents a document.
# Each column is a unique word token.
# Values of the matrix correspond to an individual document's word usage
# The DTM is the basis for many bag of words analyses
# Later in the course, you will also use the related Term Document Matrix (TDM)
# This is the transpose; that is, columns represent documents and rows represent unique word tokens
# You should construct a DTM after cleaning the corpus (using clean_corpus())
# To do so, call DocumentTermMatrix() on the corpus object
# tm_dtm <- DocumentTermMatrix(tm_clean)
# If you need a more in-depth refresher check out the Text Mining: Bag of Words course
# Hopefully these two exercises have prepared you well enough to embark on your sentiment analysis journey!
# We've created a VCorpus() object called clean_text containing 1000 tweets mentioning coffee
# The tweets have been cleaned with the previously mentioned preprocessing steps and your goal is to create a DTM from it
# clean_text is pre-defined (do not have VCorpus)
# clean_text
# Create tf_dtm
# tf_dtm <- DocumentTermMatrix(clean_text)
# Create tf_dtm_m
# tf_dtm_m <- as.matrix(tf_dtm)
# Dimensions of DTM matrix
# dim(tf_dtm_m)
# Subset part of tf_dtm_m for comparison
# tf_dtm_m[16:20, 2975:2985]
# Although Zipf observed a steep and predictable decline in word usage you may not buy into Zipf's law
# You may be thinking "I know plenty of words, and have a distinctive vocabulary"
# That may be the case, but the same can't be said for most people!
# To prove it, let's construct a visual from 3 million tweets mentioning "#sb"
# Keep in mind that the visual doesn't follow Zipf's law perfectly, the tweets all mentioned the same hashtag so it is a bit skewed
# That said, the visual you will make follows a steep decline showing a small lexical diversity among the millions of tweets
# So there is some science behind using lexicons for natural language analysis!
# In this exercise, you will use the package metricsgraphics
# Although the author suggests using the pipe %>% operator, you will construct the graphic step-by-step to learn about the various aspects of the plot
# The main function of the package metricsgraphics is the mjs_plot() function which is the first step in creating a JavaScript plot
# Once you have that, you can add other layers on top of the plot
# An example metricsgraphics workflow without using the %>% operator is below
# metro_plot <- mjs_plot(data, x = x_axis_name, y = y_axis_name, show_rollover_text = FALSE)
# metro_plot <- mjs_line(metro_plot)
# metro_plot <- mjs_add_line(metro_plot, line_one_values)
# metro_plot <- mjs_add_legend(metro_plot, legend = c('names', 'more_names'))
# metro_plot
rawWords <- c('sb', 'rt', 'the', 'to', 'a', 'for', 'esurancesweepstakes', 'you', 'broncos', 'esurance', 'in', 'is', 'of', 'on', 'win', 'and', 'panthers', 'nfl', 'i', 'super', 'at', 'with', 'bowl', 'this', 'your', 'superbowl', 'it', 'are', 'keeppounding', 'that', 'be', 'will', 'k', 'game', 'amp', 'we', 'our', 'my', 'got', 'cam', 'https\205', 'big', 'if', 'but', 'from', 'just', 'time', 'now', 'all', 'have', 'up', 'who', 'out', 'show', 'sbfanvote', 'peyton', 'so', 'chance', 'was', 'why', 'watch', 'see', 'like', 'winning', 'not', 'commercial', 'get', 'by', 'coldplay', 'more', 'think', 'what', 'go', 'one', 'do', 'over', 'here', 'halftime', 'away', 'good', 'me', 'gaga', 'lady', 'i\222ve', 'how', 'httpstco\205', 'ready', 'manning', 'pepsihalftime', 'could', 'wearing', 'ad', 'during', 'its', 'about', 'beyonce', 'doritos', 'httpst\205', 'an', 'day', 'going', 'anthem', 'after', 'national', 'than', 'team', 'want', 'gonna', 'some', 'his', 'denver', 'best', 'ladygaga', 'can', 'im', 'pass', 'today', 'enter', 'socks', 'avosinspace', 'shoes', 'sandals', 'reporter', 'jeans', 'biggame', 'httpstcoqdraydnsb', 'brunomars', 'tomorrow', 'sweepstakes', 'check', 'when', 'avosfrommexico', 'beyonc\351', 'sunday', 'great', 'seo', 'mvp', 'performance', 'as', 'love', 'they', 'new', 'field', 'did', 'congrats', 'tmobile', 'still', 'no', 'drake', 'tonight', 'special', 'yougotcarriered', 'he', 'last', 'has', 'too', 'superbowlsunday', 'lets', 'make')
rawFreq <- c(1984423, 1700564, 1101899, 588803, 428598, 388390, 326464, 322154, 296673, 292468, 266847, 265392, 245718, 234509, 233618, 233157, 215919, 212620, 202765, 183808, 182673, 176209, 175996, 172636, 146487, 143345, 142812, 136649, 134436, 130056, 128878, 126930, 116187, 115213, 114805, 108680, 103023, 88247, 88099, 84442, 82291, 82116, 79843, 78986, 77616, 77562, 75405, 73245, 70581, 68565, 68325, 66217, 66030, 64489, 63026, 62986, 62878, 61111, 60982, 59658, 59629, 57424, 56911, 56585, 56455, 56182, 55496, 55237, 54729, 53962, 52840, 50489, 46303, 46216, 45832, 45569, 44364, 43338, 42667, 42008, 41743, 41566, 41473, 40003, 39888, 39808, 39421, 38575, 38498, 37085, 35345, 32997, 31292, 31018, 30832, 29258, 29183, 28980, 28908, 27361, 27283, 23367, 23183, 22575, 22456, 21964, 21095, 20530, 20213, 19514, 19428, 19115, 18887, 18483, 18120, 16901, 14239, 14110, 13475, 13424, 13329, 13326, 13304, 13231, 13221, 13194, 12641, 12225, 11635, 11502, 11362, 11341, 11293, 11102, 10986, 10660, 10637, 10331, 10136, 10040, 9963, 9745, 9616, 9495, 9468, 9397, 9384, 9368, 9284, 8914, 8732, 8719, 8697, 8629, 8536, 8379, 8316, 7977, 7970)
rawRank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159)
sb_words <- data.frame(word=rawWords, freq=rawFreq, rank=rawRank)
# Examine sb_words
head(sb_words)
## word freq rank
## 1 sb 1984423 1
## 2 rt 1700564 2
## 3 the 1101899 3
## 4 to 588803 4
## 5 a 428598 5
## 6 for 388390 6
# Create expectations
# sb_words$expectations <- sb_words %$%
# {freq / rank}
# Probably should be something more like this
sb_words$expectations <- sb_words %$%
{freq[1] / rank}
# Create metrics plot
sb_plot <- metricsgraphics::mjs_plot(sb_words, x = rank, y = freq, show_rollover_text = FALSE)
# Add 1st line
sb_plot <- metricsgraphics::mjs_line(sb_plot)
# Add 2nd line
sb_plot <- metricsgraphics::mjs_add_line(sb_plot, expectations)
# Add legend
sb_plot <- metricsgraphics::mjs_add_legend(sb_plot, legend = c("Frequency", "Expectation"))
# Display plot
sb_plot
# So far you have learned the basic components needed for assessing positive or negative intent in text
# Remember the following points so you can feel confident in your results.
# The subjectivity lexicon is a predefined list of words associated with emotions or positive/negative feelings.
# You don't have to list every word in a subjectivity lexicon because Zipf's law describes human expression.
# A quick way to get started is to use the polarity() function which has a built-in subjectivity lexicon
# The function scans the text to identify words in the lexicon
# It then creates a word group around the identified positive or negative subjectivity word
# Within the group valence shifters adjust the score
# Valence shifters are words that amplify or negate the emotional intent of the subjectivity word
# For example, "well known" is positive while "not well known" is negative
# Here "not" is a negating term and reverses the emotional intent of "well known."
# In contrast, "very well known" employs an amplifier increasing the positive intent
# The polarity() function then calculates a score using subjectivity terms, valence shifters and the total number of words in the passage
# This exercise demonstrates a simple polarity calculation
# In the next video we look under the hood of polarity() for more detail
# We've defined positive to denote a positive statement
# Example statements
positive <- "DataCamp courses are good for learning"
# Calculate polarity of both statements
(pos_score <- polarity(positive))
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 6 0.408 NA NA
# Get counts
(pos_counts <- counts(pos_score))
## all wc polarity pos.words neg.words text.var
## 1 all 6 0.408 good - DataCamp courses are good for learning
# Number of positive words
n_good <- length(pos_counts$pos.words[[1]])
# Total number of words
n_words <- pos_counts$wc
# Verify polarity score
n_good / sqrt(n_words)
## [1] 0.4082483
# Of course just positive and negative words aren't enough
# In this exercise you will learn about valence shifters which tell you about the author's emotional intent
# Previously you applied polarity() to text without valence shifters. In this example you will see amplifers and negating words in action
# Recall that an amplifying word adds 0.8 to a positive word in polarity() so the positive score becomes 1.8
# For negative words 0.8 is subtracted so the total becomes -1.8
# Then the score is divided by the square root of the total number of words
# Consider the following example from Frank Sinatra: "It was a very good year"
# "Good" equals 1 and "very" adds another 0.8
# So, 1.8/sqrt(6) results in 0.73 polarity
# A negating word such as "not" will inverse the subjectivity score
# Consider the following example from Bobby McFerrin: "Don't worry Be Happy"
# "worry is now 1 due to the negation "don't."
# Adding the "happy", +1, equals 2
# With 4 total words, 2 / sqrt(4) equals a polarity score of 1
conversation <- data.frame(student=c('Martijn', 'Nick', 'Nicole'),
text=c('This restaurant is never bad', 'The lunch was very good', 'It was awful I got food poisoning and was extremely ill'),
stringsAsFactors=TRUE
)
# Examine conversation
conversation
## student text
## 1 Martijn This restaurant is never bad
## 2 Nick The lunch was very good
## 3 Nicole It was awful I got food poisoning and was extremely ill
# Polarity - All
polarity(conversation$text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 3 21 0.317 0.565 0.561
# Polarity - Grouped
student_pol <- conversation %$%
polarity(text, student)
# Student results
scores(student_pol)
## student total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Martijn 1 5 0.447 NA NA
## 2 Nick 1 5 0.805 NA NA
## 3 Nicole 1 11 -0.302 NA NA
# Sentence by sentence
counts(student_pol)
## student wc polarity pos.words neg.words text.var
## 1 Martijn 5 0.447 - bad This restaurant is never bad
## 2 Nick 5 0.805 good - The lunch was very good
## 3 Nicole 11 -0.302 - awful It was awful I got food poisoning and was extremely ill
# qdap plot
plot(student_pol)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: Removed 3 rows containing missing values (geom_errorbarh).
# Even with Zipf's law in action, you will still need to adjust lexicons to fit the text source (for example twitter versus legal documents) or the author's demographics (teenage girl versus middle aged man)
# This exercise demonstrates the explicit components of polarity() so you can change it if needed
# In Trey Songz "Lol :)" song there is a lyric "LOL smiley face, LOL smiley face."
# In the basic polarity() function, "LOL" is not defined as positive
# However, "LOL" stands for "Laugh Out Loud" and should be positive
# As a result, you should adjust the lexicon to fit the text's context which includes pop-culture slang
# If your analysis contains text from a specific channel (Twitter's "LOL"), location (Boston's "Wicked Good"), or age group (teenagers "sick") you will likely have to adjust the lexicon
# In this exercise you are not adjusting the subjectivity lexicon or qdap dictionaries containing valence shifters
# Instead you are examining the existing word data frame objects so you can change them in the following exercise
# We've created text containing two excerpts from Beyoncé's "Crazy in Love" lyrics for the exercise
# Examine the key.pol
key.pol
## x y
## 1: a plus 1
## 2: abnormal -1
## 3: abolish -1
## 4: abominable -1
## 5: abominably -1
## ---
## 6775: zealously -1
## 6776: zenith 1
## 6777: zest 1
## 6778: zippy 1
## 6779: zombie -1
# Negators
negation.words
## [1] "ain't" "aren't" "can't" "couldn't" "didn't"
## [6] "doesn't" "don't" "hasn't" "isn't" "mightn't"
## [11] "mustn't" "neither" "never" "no" "nobody"
## [16] "nor" "not" "shan't" "shouldn't" "wasn't"
## [21] "weren't" "won't" "wouldn't"
# Amplifiers
amplification.words
## [1] "acute" "acutely" "certain" "certainly"
## [5] "colossal" "colossally" "deep" "deeply"
## [9] "definite" "definitely" "enormous" "enormously"
## [13] "extreme" "extremely" "great" "greatly"
## [17] "heavily" "heavy" "high" "highly"
## [21] "huge" "hugely" "immense" "immensely"
## [25] "incalculable" "incalculably" "massive" "massively"
## [29] "more" "particular" "particularly" "purpose"
## [33] "purposely" "quite" "real" "really"
## [37] "serious" "seriously" "severe" "severely"
## [41] "significant" "significantly" "sure" "surely"
## [45] "true" "truly" "vast" "vastly"
## [49] "very"
# De-amplifiers
deamplification.words
## [1] "barely" "faintly" "few" "hardly"
## [5] "little" "only" "rarely" "seldom"
## [9] "slightly" "sparesly" "sporadically" "very few"
## [13] "very little"
text <- data.frame(speaker=c("beyonce", "jay_z"),
words=c("I know I dont understand Just how your love can do what no one else can", "They cant figure him out they like hey, is he insane"),
stringsAsFactors=TRUE
)
# Examine
text
## speaker
## 1 beyonce
## 2 jay_z
## words
## 1 I know I dont understand Just how your love can do what no one else can
## 2 They cant figure him out they like hey, is he insane
# Explicit polarity parameters
polarity(
text.var = text$words,
grouping.var = text$speaker,
polarity.frame = key.pol,
negators = negation.words,
amplifiers = amplification.words,
deamplifiers = deamplification.words
)
## speaker total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 beyonce 1 16 0.25 NA NA
## 2 jay_z 1 11 0.00 NA NA
# Here you will adjust the negative words to account for the specific text. You will then compare the basic and custom polarity() scores
# A popular song from Twenty One Pilots is called "Stressed Out".
# If you scan the lyrics of this song, you will observe the song is about youthful nostalgia
# Overall, most people would say the polarity is negative
# Repeatedly the lyrics mention stress, fears and pretending
# Let's compare the song lyrics using the default subjectivity lexicon and also a custom one
# To start, you need to verify the key.pol subjectivity lexicon does not already have the term you want to add
# One way to check is with grep()
# The grep() function returns the row containing characters that match a search pattern
# data_frame[grep("search_pattern", data_frame$column), ]
# After verifying the slang or new word is not already in the key.pol lexicon you need to add it
# The code below uses sentiment_frame() to construct the new lexicon
# Within the code sentiment_frame() accepts the original positive word vector, positive.words
# Next, the original negative.words are concatenated to "smh" and "kappa", both considered negative slang
# Although you can declare the positive and negative weights, the default is 1 and -1 so they are not included below
# custom_pol <- sentiment_frame(positive.words, c(negative.words, "hate", "pain"))
stressed_out <- "I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"
# stressed_out has been pre-defined
head(stressed_out)
## [1] "I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"
# Basic lexicon score
polarity(stressed_out)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.255 NA NA
# Check the subjectivity lexicon
key.pol[grep("stress", x)]
## x y
## 1: distress -1
## 2: distressed -1
## 3: distressing -1
## 4: distressingly -1
## 5: mistress -1
## 6: stress -1
## 7: stresses -1
## 8: stressful -1
## 9: stressfully -1
# New lexicon
custom_pol <- sentiment_frame(positive.words, c(negative.words, "stressed", "turn back"))
# Compare new score
polarity(stressed_out, polarity.frame = custom_pol)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.826 NA NA
Chapter 2 - Sentiment Analysis the tidytext Way
Plutchik’s wheel of emotion, polarity vs. sentiment:
Bing lexicon with inner join:
AFINN and NRC methodologies in more detail:
Example code includes:
# There is a growing number of "tidyverse" R packages
# The tidyverse is a collection of R packages that share common philosophies and are designed to work together
# This chapter covers some tidy functions to manipulate data
# In fact, in this exercise you will compare a DTM to a tidy text data frame called a tibble
# Within the tidyverse each observation is a single row in a data frame
# That makes working in different packages much easier since the fundamental data structure is the same
# Parts of this course borrow heavily from the tidytext package which uses this data organization
# To change a DTM to a tidy format use tidy() from the broom package.
# tidy_format <- tidy(Document_Term_Matrix)
# This exercise uses text from the Greek tragedy, Agamemnon
# Agamemnon is a story about marital infidelity and murder
# You can download a copy here (http://www.gutenberg.org/ebooks/14417?msg=welcome_stranger)
# We've already created a clean DTM called ag_dtm for this exercise.
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
return(corpus)
}
agRawText <- readLines("./RInputFiles/pg14417.txt")
agSource <- VectorSource(agRawText[318:3430])
agCorpus <- VCorpus(agSource)
agClean <- clean_corpus(agCorpus)
ag_dtm <- DocumentTermMatrix(agClean)
# As matrix
ag_dtm_m <- as.matrix(ag_dtm)
# Examine line 2206 and columns 245:250 (edited to 2206 and 308:313)
ag_dtm_m[2206, 308:313]
## bleed bleeds blent bless blessã¨d blessing
## 0 0 0 1 0 0
# Tidy up the DTM (function does not work here . . . )
# ag_tidy <- broom::tidy(ag_dtm)
ag_tidy <- tibble::tibble(document=ag_dtm$dimnames$Docs[ag_dtm$i],
term=ag_dtm$dimnames$Terms[ag_dtm$j],
count=ag_dtm$v
)
# Examine tidy with a word you saw
ag_tidy[824:828, ]
## # A tibble: 5 x 3
## document term count
## <chr> <chr> <dbl>
## 1 234 bleeds 1.00
## 2 234 sleepeth 1.00
## 3 235 comes 1.00
## 4 235 will 1.00
## 5 235 wisdom 1.00
# So far you have used a single lexicon
# Now we will transition to using three, each measuring sentiment in different ways
# The tidytext package contains a data frame called sentiments
# The data frame contains over 23000 terms from three different subjectivity lexicons with corresponding information
# Here are some example rows from the sentiments data frame
# Notice the tidy format
# Each word is a row and NAs fill in columns that are not applicable
# The "AFINN" lexicon scores words from 5 to -5
# The "Bing" lexicon is the same lexicon used in qdap's polarity() function
# "Bing" words are only labeled as positive or negative
# The "NRC" lexicon has distinct emotional classes covering Plutchik's Wheel and positive and negative
# Subset to AFINN
afinn_lex <- tidytext::get_sentiments("afinn")
# Count AFINN scores
afinn_lex %>%
count(score)
## # A tibble: 11 x 2
## score n
## <int> <int>
## 1 -5 16
## 2 -4 43
## 3 -3 264
## 4 -2 965
## 5 -1 309
## 6 0 1
## 7 1 208
## 8 2 448
## 9 3 172
## 10 4 45
## 11 5 5
# Subset to nrc
nrc_lex <- tidytext::get_sentiments("nrc")
# Print nrc_lex
nrc_lex
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
# Make the nrc counts object
nrc_counts <- nrc_lex %>%
count(sentiment)
# Barplot
ggplot(nrc_counts, aes(x = sentiment, y = n))+
geom_bar(stat = "identity") +
ggthemes::theme_gdocs()
# The Bing lexicon labels words as positive or negative
# The next three exercises let you interact with this specific lexicon
# Instead of using filter() to extract a lexicon this exercise uses get_sentiments() which accepts a string such as "afinn", "bing", "nrc", or "loughran"
# Now that you understand the basics of an inner join, let's apply this to the "Bing" lexicon
# Keep in mind the inner_join() function comes from dplyr and the sentiments object is from tidytext
# The inner join workflow:
# Obtain the correct lexicon using either filter() or get_sentiments().
# Pass the lexicon and the tidy text data to inner_join().
# In order for inner_join() to work there must be a shared column name. If there are no shared column names, declare them with an additional parameter, by equal to c with column names like below
# object <- x %>%
# inner_join(y, by = c("column_from_x" = "column_from_y")
# We've loaded ag_txt containing the first 100 lines from Agamemnon and ag_tidy which is the tidy version
ag_txt <- agRawText[agRawText != ""][1:100]
# Qdap polarity
polarity(ag_txt)
## Warning in polarity(ag_txt):
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 100 1038 -0.093 0.364 -0.255
# Get Bing lexicon
bing <- tidytext::get_sentiments("bing")
# Join text to lexicon
ag_bing_words <- inner_join(ag_tidy, bing, by = c("term" = "word"))
# Examine
ag_bing_words
## # A tibble: 1,641 x 4
## document term count sentiment
## <chr> <chr> <dbl> <chr>
## 1 10 waste 1.00 negative
## 2 11 respite 1.00 positive
## 3 13 well 1.00 positive
## 4 14 lonely 1.00 negative
## 5 16 great 1.00 positive
## 6 16 heavenly 1.00 positive
## 7 22 dark 1.00 negative
## 8 23 fear 1.00 negative
## 9 24 warning 1.00 negative
## 10 25 well 1.00 positive
## # ... with 1,631 more rows
# Get counts by sentiment
ag_bing_words %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 1033
## 2 positive 608
# The spread() function spreads a key-value pair across multiple columns
# In this case key is the sentiment and the values are the frequency of positive or negative terms for each line
# Using spread() changes the data so that each row now has positive and negative values, even if it is 0
# In this exercise, your R session has m_dick_tidy which contains the book Moby Dick and bing, containing the lexicon similar to the previous exercise
all_books <- readRDS("./RInputFiles/all_books.rds")
m_dick_tidy <- all_books[all_books$book=="moby_dick", c("term", "document", "count")]
m_dick_tidy
## # A tibble: 109,996 x 3
## term document count
## <chr> <chr> <dbl>
## 1 chapter 2 1.00
## 2 loomings 2 1.00
## 3 agonever 5 1.00
## 4 call 5 1.00
## 5 ishmael 5 1.00
## 6 long 5 1.00
## 7 mind 5 1.00
## 8 preciselyhaving 5 1.00
## 9 some 5 1.00
## 10 years 5 1.00
## # ... with 109,986 more rows
# Inner join
moby_lex_words <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
moby_lex_words <- moby_lex_words %>%
# Set index to numeric document
mutate(index = as.numeric(document))
moby_count <- moby_lex_words %>%
# Count by sentiment, index
count(sentiment, index)
# Examine the counts
moby_count
## # A tibble: 10,594 x 3
## sentiment index n
## <chr> <dbl> <int>
## 1 negative 9.00 1
## 2 negative 11.0 1
## 3 negative 22.0 1
## 4 negative 41.0 1
## 5 negative 42.0 2
## 6 negative 44.0 1
## 7 negative 56.0 1
## 8 negative 64.0 1
## 9 negative 66.0 1
## 10 negative 68.0 1
## # ... with 10,584 more rows
moby_spread <- moby_count %>%
# Spread sentiments
tidyr::spread(sentiment, n, fill = 0)
# Review the spread data
moby_spread
## # A tibble: 9,229 x 3
## index negative positive
## <dbl> <dbl> <dbl>
## 1 9.00 1.00 0
## 2 11.0 1.00 0
## 3 13.0 0 1.00
## 4 17.0 0 1.00
## 5 19.0 0 1.00
## 6 22.0 1.00 0
## 7 24.0 0 1.00
## 8 25.0 0 1.00
## 9 31.0 0 2.00
## 10 35.0 0 2.00
## # ... with 9,219 more rows
# The last Bing lexicon exercise!
# We started with this lexicon since its similar to the results in Chapter 1
# In this exercise you will use the pipe operator (%>%) to create a timeline of the sentiment in Moby Dick
# In the end you will also create a simple visual following the code structure below
# The next chapter goes into more depth for visuals
# Your R session has moby as your text and bing as your lexicon
# After this exercise you should know Is Moby Dick a happy or sad book?
moby_polarity <- m_dick_tidy %>%
mutate(index = as.numeric(document)) %>%
# Inner join to lexicon
inner_join(bing, by = c("term" = "word")) %>%
# Count the sentiment scores
count(sentiment, index) %>%
# Spread the sentiment into positive and negative columns
tidyr::spread(sentiment, n, fill = 0) %>%
# Add polarity column
mutate(polarity = positive - negative)
# Plot polarity vs. index
ggplot(moby_polarity, aes(x=index, y=polarity)) +
# Add a smooth trend curve
geom_smooth()
## `geom_smooth()` using method = 'gam'
# Now we transition to the AFINN lexicon
# The AFINN lexicon has numeric values from 5 to -5, not just positive or negative
# Unlike the Bing lexicon's sentiment, the AFINN lexicon's sentiment score column is called score
# As before, you apply inner_join() then count()
# Next, to sum the scores of each line, we use dplyr's group_by() and summarize() functions
# The group_by() function takes an existing data frame and converts it into a grouped data frame where operations are performed "by group"
# Then, the summarize() function lets you calculate a value for each group in your data frame using a function that aggregates data, like sum() or mean()
# So, in our case we can do something like
# data_frame %>%
# group_by(book_line) %>%
# summarize(total_score = sum(book_line))
# In the tidy version of Huckleberry Finn, line 9703 contains words "best", "ever", "fun", "life" and "spirit". "best" and "fun" have AFINN scores of 3 and 4 respectively
# After aggregating, line 9703 will have a total score of 7
# The afinn object contains the AFINN lexicon
# The huck object is a tidy version of Mark Twain's Adventures of Huckleberry Finn for analysis
# Line 5400 is All the loafers looked glad; I reckoned they was used to having fun out of Boggs
# Stopwords and punctuation have already been removed in the dataset
huck <- all_books[all_books$book=="huck_finn", c("term", "document", "count")] %>%
mutate(document=as.numeric(document)) %>%
rename(line=document)
huck
## # A tibble: 55,198 x 3
## term line count
## <chr> <dbl> <dbl>
## 1 finn 1.00 1.00
## 2 ïhuckleberry 1.00 1.00
## 3 ago 3.00 1.00
## 4 fifty 3.00 1.00
## 5 forty 3.00 1.00
## 6 mississippi 3.00 1.00
## 7 scene 3.00 1.00
## 8 the 3.00 1.00
## 9 time 3.00 1.00
## 10 valley 3.00 1.00
## # ... with 55,188 more rows
# See abbreviated line 5400
huck %>% filter(line == 5400)
## # A tibble: 7 x 3
## term line count
## <chr> <dbl> <dbl>
## 1 all 5400 1.00
## 2 fun 5400 1.00
## 3 glad 5400 1.00
## 4 loafers 5400 1.00
## 5 looked 5400 1.00
## 6 reckoned 5400 1.00
## 7 used 5400 1.00
# What are the scores of the sentiment words?
afinn_lex %>% filter(word %in% c("fun", "glad"))
## # A tibble: 2 x 2
## word score
## <chr> <int>
## 1 fun 4
## 2 glad 3
huck_afinn <- huck %>%
# Inner Join to AFINN lexicon
inner_join(afinn_lex, by = c("term" = "word")) %>%
# Count by score and line
count(score, line)
huck_afinn_agg <- huck_afinn %>%
# Group by line
group_by(line) %>%
# Sum scores by line
summarize(total_score = sum(score))
# Filter huck_afinn_agg
huck_afinn_agg %>% filter(line == 5400)
## # A tibble: 1 x 2
## line total_score
## <dbl> <int>
## 1 5400 7
# Plot total score vs. line
ggplot(huck_afinn_agg, aes(x=line, y=total_score)) +
# Add a smooth trend curve
geom_smooth()
## `geom_smooth()` using method = 'gam'
# Last but not least, you get to work with the NRC lexicon which labels words across multiple emotional states
# Remember Plutchik's wheel of emotion? The NRC lexicon tags words according to Plutchik's 8 emotions plus positive/negative
# In this exercise there is a new operator, %in%, which matches a vector to another
# In the code below %in% will return FALSE, FALSE, TRUE
# This is because within some_vec, 1 and 2 are not found within some_other_vector but 3 is found and returns TRUE
# The %in% is useful to find matches
# We've created oz which is the tidy version of The Wizard of Oz along with nrc containing the "NRC" lexicon with renamed columns
# Switched to Julius Caesar since it is what is easily available in the dataset
jc <- all_books[all_books$book=="julius_caesar", c("term", "document", "count")] %>%
mutate(document=as.numeric(document)) %>%
rename(line=document)
jc
## # A tibble: 13,165 x 3
## term line count
## <chr> <dbl> <dbl>
## 1 etext 1.00 1.00
## 2 file 1.00 1.00
## 3 gutenberg 1.00 1.00
## 4 ïthis 1.00 1.00
## 5 presented 1.00 1.00
## 6 project 1.00 1.00
## 7 cooperation 2.00 1.00
## 8 inc 2.00 1.00
## 9 library 2.00 2.00
## 10 world 2.00 1.00
## # ... with 13,155 more rows
# Join text and lexicon
jc_nrc <- inner_join(jc, nrc_lex, by = c("term" = "word"))
# DataFrame of tally
jc_plutchik <- jc_nrc %>%
# Only consider Plutchik sentiments
filter(!sentiment %in% c("positive", "negative")) %>%
# Group by sentiment
group_by(sentiment) %>%
# Get total count by sentiment
summarize(total_count = sum(count))
# Plot the counts
ggplot(jc_plutchik, aes(x = sentiment, y = total_count)) +
# Add a column geom
geom_col()
Chapter 3 - Visualizing Sentiment
Parlor trick or worthwhile?
Introduction using sentiment analysis:
Interpreting visualizations:
Example code includes:
# Sometimes you want to track sentiment over time
# For example, during an ad campaign you could track brand sentiment to see the campaign's effect
# You saw a few examples of this at the end of the last chapter
# In this exercise you'll recap the workflow for exploring sentiment over time using the novel Moby Dick
# One should expect that happy moments in the book would have more positive words than negative
# Conversely dark moments and sad endings should use more negative language
# You'll also see some tricks to make your sentiment time series more visually appealling
moby_polarity <- m_dick_tidy %>%
mutate(index = as.numeric(document)) %>%
# Inner join to the lexicon
inner_join(bing, by=c("term" = "word")) %>%
# Count by sentiment, index
count(sentiment, index) %>%
# Spread sentiments
tidyr::spread(sentiment, n, fill=0) %>%
mutate(
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number()
)
# Plot
ggplot(moby_polarity, aes(x=line_number, y=polarity)) +
geom_smooth() +
geom_hline(yintercept = 0, color = "red") +
ggtitle("Moby Dick Chronological Polarity") +
ggthemes::theme_gdocs()
## `geom_smooth()` using method = 'gam'
# One of the easiest ways to explore data is with a frequency analysis
# Although not difficult, in sentiment analysis this simple method can be surprisingly illuminating
# Specifically, you will build a barplot. In this exercise you are once again working with moby and bing to construct your visual
# Inner join without renamed columns
moby_sents <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
# Tidy sentiment calculation
moby_tidy_sentiment <- moby_sents %>%
count(term, sentiment, wt = count) %>%
tidyr::spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
# Review
moby_tidy_sentiment
## # A tibble: 2,362 x 4
## term negative positive polarity
## <chr> <dbl> <dbl> <dbl>
## 1 abominable 3.00 0 -3.00
## 2 abominate 1.00 0 -1.00
## 3 abomination 1.00 0 -1.00
## 4 abound 0 3.00 3.00
## 5 abruptly 2.00 0 -2.00
## 6 absence 5.00 0 -5.00
## 7 absurd 3.00 0 -3.00
## 8 absurdly 1.00 0 -1.00
## 9 abundance 0 3.00 3.00
## 10 abundant 0 2.00 2.00
## # ... with 2,352 more rows
# Subset
moby_tidy_small <- moby_tidy_sentiment %>%
filter(abs(polarity) >= 50)
# Add polarity
moby_tidy_pol <- moby_tidy_small %>%
mutate(
pol = ifelse(polarity > 0, "positive", "negative")
)
# Plot
ggplot(
moby_tidy_pol,
aes(reorder(term, polarity), polarity, fill = pol)
) +
geom_bar(stat = "identity") +
ggtitle("Moby Dick: Sentiment Word Frequency") +
ggthemes::theme_gdocs() +
theme(axis.text.x = element_text(angle = 90, vjust = -0.1))
# Now that you have seen how polarity can be used to divide a corpus, let's do it!
# This code will walk you through dividing a corpus based on sentiment so you can peer into the informaton in subsets instead of holistically
# Your R session has oz_pol which was created by applying polarity() to "The Wonderful Wizard of Oz."
# For simplicity's sake, we created a simple custom function called pol_subsections() which will divide the corpus by polarity score
# First, the function accepts a data frame with each row being a sentence or document of the corpus
# The data frame is subset anywhere the polarity values are greater than or less than 0
# Finally, the positive and negative sentences, non-zero polarities, are pasted with parameter collapse so that the terms are grouped into a single corpus
# Lastly, the two documents are concatenated into a single vector of two distinct documents
pol_subsections <- function(df) {
x.pos <- subset(df$text, df$polarity > 0)
x.neg <- subset(df$text, df$polarity < 0)
x.pos <- paste(x.pos, collapse = " ")
x.neg <- paste(x.neg, collapse = " ")
all.terms <- c(x.pos, x.neg)
return(all.terms)
}
# At this point you have omitted the neutral sentences and want to focus on organizing the remaining text
# In this exercise we use the %>% operator again to forward objects to functions
# After some simple cleaning use comparison.cloud() to make the visual
# Using Agamemnon instead since easily available
ag_pol <- polarity(agRawText[318:3430])
## Warning in polarity(agRawText[318:3430]):
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
# Add scores to each document line in a data frame
ag_df <- ag_pol$all %>%
select(text = text.var, polarity = polarity)
# Custom function
all_terms <- pol_subsections(ag_df)
# Make a corpus
all_corpus <- all_terms %>%
VectorSource() %>%
VCorpus()
# Basic TDM
all_tdm <- TermDocumentMatrix(
all_corpus,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en")
)
) %>%
as.matrix() %>%
set_colnames(c("positive", "negative"))
# Make a comparison cloud
wordcloud::comparison.cloud(
all_tdm,
max.words = 50,
colors = c("darkgreen", "darkred")
)
# In this exercise you go beyond subsetting on positive and negative language
# Instead you will subset text by each of the 8 emotions in Plutchik's emotional wheel to construct a visual
# With this approach you will get more clarity in word usage by mapping to a specific emotion instead of just positive or negative.
# Using the tidytext subjectivity lexicon, "nrc", you perform an inner_join() with your text
# The "nrc" lexicon has the 8 emotions plus positive and negative term classes
# So you will have to drop positive and negative words after performing your inner_join()
# One way to do so is with the negation, !, and grepl()
# The "Global Regular Expression Print Logical," grepl(), function will return a True or False if a string pattern is identified in each row
# In this exercise you will search for positive OR negative using the | operator, representing "or" as shown below
# Often this straight line is above the enter key on a keyboard
# Since the ! negation precedes grepl(), the T or F is switched so the "positive|negative" is dropped instead of kept
# Next you apply count() on the identified words along with spread() to get the data frame organized
# This exercise introduces rownames()
# This function declares the names of rows in a data frame
# It behaves a bit differently because rownames() is passed the object gaining the row names on the left side of <-
# On the right side the character vector of names is declared such as data_frame[, 1]. For instance:
# rownames(data_frame) <- vector_of_names
# After setting row names you will create a more varied comparison.cloud()
# NOTE - appears NRC is already converted to have 'term' rather than 'word'
# Inner join
moby_sentiment <- inner_join(m_dick_tidy, nrc_lex, by = c("term" = "word"))
# Drop positive or negative
moby_pos_neg <- moby_sentiment %>%
filter(!grepl("positive|negative", sentiment))
# Count terms by sentiment then spread
moby_tidy <- moby_pos_neg %>%
count(sentiment, term = term) %>%
tidyr::spread(sentiment, n, fill = 0) %>%
as.data.frame()
# Set row names
rownames(moby_tidy) <- moby_tidy[, 1]
# Drop terms column
moby_tidy[, 1] <- NULL
# Examine
head(moby_tidy)
## anger anticipation disgust fear joy sadness surprise trust
## abandon 0 0 0 3 0 3 0 0
## abandoned 7 0 0 7 0 7 0 0
## abandonment 2 0 0 2 0 2 2 0
## abhorrent 1 0 1 1 0 0 0 0
## abominable 0 0 3 3 0 0 0 0
## abomination 1 0 1 1 0 0 0 0
# Comparison cloud
wordcloud::comparison.cloud(moby_tidy, max.words = 50, title.size = 1.5)
# Another way to slice your text is to understand how much of the document(s) are made of positive or negative words
# For example a restaurant review may have some positive aspects such as "the food was good" but then continue to add "the restaurant was dirty, the staff was rude and parking was awful."
# As a result, you may want to understand how much of a document is dedicated to positive vs negative language
# In this example it would have a higher negative percentage compared to positive
# One method for doing so is to count() the positive and negative words then divide by the number of subjectivity words identified
# In the restaurant review example, "good" would count as 1 positive and "dirty," "rude," and "awful" count as 3 negative terms
# A simple calculation would lead you to believe the restaurant review is 25% positive and 75% negative since there were 4 subjectivity terms
# Start by performing the inner_join() on a unified tidy data frame containing 4 books, Agamemnon, Oz, Huck Finn, and Moby Dick
# Just like the previous exercise you will use filter() and grepl()
# To perform the count() you have to group the data by book and then sentiment
# For example all the positive words for Agamemnon have to be grouped then tallied so that positive words from all books are not mixed
# Luckily, you can pass multiple variables into count() directly
# Forward book_sents, which is the NRC inner join to all tidy books, to filter()
# Review tail of all_books
tail(all_books)
## # A tibble: 6 x 5
## term document count author book
## <chr> <chr> <dbl> <chr> <chr>
## 1 ebooks 19117 1.00 twain innocents_abroad
## 2 email 19117 1.00 twain innocents_abroad
## 3 hear 19117 1.00 twain innocents_abroad
## 4 new 19117 1.00 twain innocents_abroad
## 5 newsletter 19117 1.00 twain innocents_abroad
## 6 subscribe 19117 1.00 twain innocents_abroad
# Inner join
books_sents <- inner_join(all_books, nrc_lex, by=c("term"="word"))
# Keep only positive or negative
books_pos_neg <- books_sents %>%
filter(grepl("positive|negative", sentiment))
# Review tail again
tail(books_pos_neg)
## # A tibble: 6 x 6
## term document count author book sentiment
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 included 19106 1.00 twain innocents_abroad positive
## 2 compliance 19107 1.00 twain innocents_abroad positive
## 3 main 19110 1.00 twain innocents_abroad positive
## 4 information 19114 1.00 twain innocents_abroad positive
## 5 including 19115 1.00 twain innocents_abroad positive
## 6 foundation 19116 1.00 twain innocents_abroad positive
# Count by book & sentiment
books_sent_count <- books_pos_neg %>%
count(book, sentiment)
# Review entire object
books_sent_count
## # A tibble: 22 x 3
## book sentiment n
## <chr> <chr> <int>
## 1 bartleby negative 537
## 2 bartleby positive 864
## 3 confidence_man negative 3561
## 4 confidence_man positive 5899
## 5 ct_yankee negative 4048
## 6 ct_yankee positive 6154
## 7 hamlet negative 1677
## 8 hamlet positive 2250
## 9 huck_finn negative 2471
## 10 huck_finn positive 3544
## # ... with 12 more rows
# Split, make proportional
book_pos <- books_sent_count %>%
group_by(book) %>%
mutate(percent_positive = n / sum(n) * 100)
# Proportional bar plot
ggplot(book_pos, aes(x = book, y = percent_positive, fill = sentiment)) +
geom_bar(stat = "identity")
# We've loaded ag as a tidy version of Agamemnon and created afinn as a subset of the tidytext "afinn" lexicon
# Agamemnon inner join
ag_afinn <- inner_join(ag_tidy, afinn_lex, by=c("term"="word")) %>%
mutate(line=as.numeric(document)) %>%
select(-document)
# Add book
ag_afinn$book <- "agamemnon"
# Oz inner join (use jc instead)
jc_afinn <- inner_join(jc, afinn_lex, by=c("term"="word"))
# Add book
jc_afinn$book <- "jc"
# Combine
all_df <- rbind(ag_afinn, jc_afinn)
# Plot 2 densities
ggplot(all_df, aes(x = score, fill = book)) +
geom_density(alpha = 0.3) +
ggthemes::theme_gdocs() +
ggtitle("AFINN Score Densities")
# In this exercise the all_book_polarity object is already loaded
# The data frame contains two columns, book and polarity
# It comprises all books with qdap's polarity() function applied
all_book_polarity <- readRDS("./RInputFiles/all_book_polarity.rds")
# Examine
str(all_book_polarity)
## 'data.frame': 14437 obs. of 2 variables:
## $ book : Factor w/ 4 levels "huck","agamemnon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ polarity: num 0.277 0.258 -0.577 0.25 0.516 ...
# Summary by document
tapply(all_book_polarity$polarity, all_book_polarity$book, FUN=summary)
## $huck
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.38700 -0.25820 0.23570 0.04156 0.26730 1.60400
##
## $agamemnon
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4670 -0.3780 -0.3333 -0.1266 0.3333 1.2250
##
## $moby
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.13300 -0.28870 -0.25000 -0.02524 0.28870 1.84800
##
## $oz
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2730 -0.2774 0.2582 0.0454 0.2887 1.1880
# Box plot
ggplot(all_book_polarity, aes(x = book, y = polarity)) +
geom_boxplot(fill = c("#bada55", "#F00B42", "#F001ED", "#BA6E15"), col = "darkred") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 0.02) +
ggthemes::theme_gdocs() +
ggtitle("Book Polarity")
# Remember Plutchik's wheel of emotion?
# The NRC lexicon has the 8 emotions corresponding to the first ring of the wheel
# Previously you created a comparison.cloud() according to the 8 primary emotions
# Now you will create a radar chart similar to the wheel in this exercise
# A radarchart is a two-dimensional representation of multidimensional data (at least 3)
# In this case the tally of the different emotions for a book are represented in the chart
# Using a radar chart, you can review all 8 emotions simultaneously
# As before we've loaded the "nrc" lexicon as nrc and moby_huck which is a combined tidy version of both Moby Dick and Huck Finn
bindMoby <- m_dick_tidy %>%
mutate(document=as.numeric(document), book="moby")
bindHuck <- huck %>%
mutate(book="huck") %>%
rename(document=line)
moby_huck <- rbind(bindMoby, bindHuck)
moby_huck
## # A tibble: 165,194 x 4
## term document count book
## * <chr> <dbl> <dbl> <chr>
## 1 chapter 2.00 1.00 moby
## 2 loomings 2.00 1.00 moby
## 3 agonever 5.00 1.00 moby
## 4 call 5.00 1.00 moby
## 5 ishmael 5.00 1.00 moby
## 6 long 5.00 1.00 moby
## 7 mind 5.00 1.00 moby
## 8 preciselyhaving 5.00 1.00 moby
## 9 some 5.00 1.00 moby
## 10 years 5.00 1.00 moby
## # ... with 165,184 more rows
# Review tail of moby_huck
tail(moby_huck)
## # A tibble: 6 x 4
## term document count book
## <chr> <dbl> <dbl> <chr>
## 1 subscribe 11788 1.00 huck
## 2 ebooks 11789 1.00 huck
## 3 email 11789 1.00 huck
## 4 hear 11789 1.00 huck
## 5 new 11789 1.00 huck
## 6 newsletter 11789 1.00 huck
# Inner join
books_sents <- inner_join(moby_huck, nrc_lex, by=c("term"="word"))
# Drop positive or negative
books_pos_neg <- books_sents %>%
filter(!grepl("positive|negative", sentiment))
# Tidy tally
books_tally <- books_pos_neg %>%
group_by(book, sentiment) %>%
tally()
# Key value pairs
scores <- books_tally %>%
tidyr::spread(book, n)
# Review scores
scores
## # A tibble: 8 x 3
## sentiment huck moby
## <chr> <int> <int>
## 1 anger 1123 2811
## 2 anticipation 2214 4740
## 3 disgust 823 2025
## 4 fear 1332 4178
## 5 joy 1713 3175
## 6 sadness 1303 3393
## 7 surprise 1154 2153
## 8 trust 2191 5099
# JavaScript radar chart
radarchart::chartJSRadar(scores)
# Make the scores relatove to total
scoresRelative <- scores %>%
mutate(huckRel = huck/sum(huck), mobyRel=moby/sum(moby))
scoresRelative
## # A tibble: 8 x 5
## sentiment huck moby huckRel mobyRel
## <chr> <int> <int> <dbl> <dbl>
## 1 anger 1123 2811 0.0947 0.102
## 2 anticipation 2214 4740 0.187 0.172
## 3 disgust 823 2025 0.0694 0.0734
## 4 fear 1332 4178 0.112 0.152
## 5 joy 1713 3175 0.145 0.115
## 6 sadness 1303 3393 0.110 0.123
## 7 surprise 1154 2153 0.0974 0.0781
## 8 trust 2191 5099 0.185 0.185
# JavaScript radar chart
radarchart::chartJSRadar(scoresRelative[, c("sentiment", "huckRel", "mobyRel")])
# Often you will find yourself working with documents in groups, such as author, product or by company
# This exercise lets you learn about the text while retaining the groups in a compact visual
# For example, with customer reviews grouped by product you may want to explore multiple dimensions of the customer reviews at the same time
# First you could calculate the polarity() of the reviews. Another dimension may be length
# Document length can demonstrate the emotional intensity
# If a customer leaves a short "great shoes!" one could infer they are actually less enthusiastic compared to a lengthier positive review
# You may also want to group reviews by product type such as women's, men's and children's shoes. A treemap lets you examine all of these dimensions
# For text analysis, within a treemap each individual box represents a document such as a tweet
# Documents are grouped in some manner such as author
# The size of each box is determined by a numeric value such as number of words or letters
# The individual colors are determined by a sentiment score
# After you organize the tibble, you use the treemap library containing the function treemap() to make the visual
# The code example below declares the data, grouping variables, size, color and other aesthetics
# treemap(data_frame,
# index = c("group", "individual_document"),
# vSize = "V1",
# vColor = "avg_score",
# type = "value",
# title = "Book Sentiment Scores",
# palette = c("red", "white", "green"))
# The pre-loaded all_books object contains a combined tidy format corpus with 4 Shakespeare, 3 Melville and 4 Twain books
# Based on the treemap you should be able to tell who writes longer books, and the polarity of the author as a whole and for individual books
books_score <- all_books %>%
# Inner join with AFINN scores
inner_join(afinn_lex, by=c("term" = "word"))
book_length <- books_score %>%
# Count number of words per book
count(book)
book_score <- books_score %>%
# Group by author, book
group_by(author, book) %>%
# Calculate mean book score
summarize(mean_score = mean(score))
book_tree <- book_score %>%
# Inner join by book
inner_join(book_length, by=c("book"))
# Examine the results
book_tree
## # A tibble: 11 x 4
## # Groups: author [?]
## author book mean_score n
## <chr> <chr> <dbl> <int>
## 1 melville bartleby 0.101 761
## 2 melville confidence_man 0.506 5480
## 3 melville moby_dick 0.161 8973
## 4 shakespeare hamlet 0.0984 2064
## 5 shakespeare julius_caesar 0.0846 1359
## 6 shakespeare macbeth 0.222 910
## 7 shakespeare romeo_juliet 0.175 1978
## 8 twain ct_yankee 0.199 6083
## 9 twain huck_finn 0.0763 4849
## 10 twain innocents_abroad 0.405 8988
## 11 twain tom_sawyer -0.0265 3741
# Make the visual
treemap::treemap(book_tree,
index = c("author", "book"),
vSize = "n",
vColor = "mean_score",
type = "value",
title = "Book Sentiment Scores",
palette = c("red", "white", "green")
)
Chapter 4 - Case Study: Airbnb